2023-10-09 11:41:15,359 INFO [train.py:1099] (3/4) Training started 2023-10-09 11:41:15,360 INFO [train.py:1109] (3/4) Device: cuda:3 2023-10-09 11:41:15,397 INFO [train.py:1121] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '0d7ef1a7867f70354ab5c59f2feb98c45558dcc7', 'k2-git-date': 'Sat Mar 18 12:59:04 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': '9a94348-dirty', 'icefall-git-date': 'Wed Sep 20 16:11:36 2023', 'icefall-path': '/mnt/lustre/sjtu/home/yfy62/icefall-phone2', 'k2-path': '/home/yfy62/anaconda3/envs/icefall/lib/python3.10/site-packages/k2-1.23.4.dev20230319+cuda11.8.torch2.0.0-py3.10-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/yfy62/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'd3-hpc-sjtu-test-004', 'IP address': '10.11.11.11'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_XL_bpe'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 8000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 700, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 500} 2023-10-09 11:41:15,398 INFO [train.py:1123] (3/4) About to create model 2023-10-09 11:41:16,101 INFO [train.py:1127] (3/4) Number of model parameters: 65549011 2023-10-09 11:41:17,982 INFO [train.py:1142] (3/4) Using DDP 2023-10-09 11:41:18,769 INFO [asr_datamodule.py:396] (3/4) About to get train XL cuts 2023-10-09 11:41:18,773 INFO [asr_datamodule.py:405] (3/4) Loading GigaSpeech 1000 splits in lazy mode 2023-10-09 11:42:05,507 INFO [asr_datamodule.py:230] (3/4) Enable MUSAN 2023-10-09 11:42:05,507 INFO [asr_datamodule.py:231] (3/4) About to get Musan cuts 2023-10-09 11:42:07,981 INFO [asr_datamodule.py:255] (3/4) Enable SpecAugment 2023-10-09 11:42:07,982 INFO [asr_datamodule.py:256] (3/4) Time warp factor: 80 2023-10-09 11:42:07,982 INFO [asr_datamodule.py:266] (3/4) Num frame mask: 10 2023-10-09 11:42:07,982 INFO [asr_datamodule.py:279] (3/4) About to create train dataset 2023-10-09 11:42:07,982 INFO [asr_datamodule.py:306] (3/4) Using DynamicBucketingSampler. 2023-10-09 11:42:19,143 INFO [asr_datamodule.py:321] (3/4) About to create train dataloader 2023-10-09 11:42:19,144 INFO [asr_datamodule.py:420] (3/4) About to get dev cuts 2023-10-09 11:42:19,146 INFO [asr_datamodule.py:352] (3/4) About to create dev dataset 2023-10-09 11:42:19,400 INFO [asr_datamodule.py:366] (3/4) About to create dev dataloader 2023-10-09 11:42:47,039 INFO [train.py:1031] (3/4) Epoch 1, batch 0, loss[loss=7.806, simple_loss=7.109, pruned_loss=6.959, over 15933.00 frames. ], tot_loss[loss=7.806, simple_loss=7.109, pruned_loss=6.959, over 15933.00 frames. ], batch size: 43, lr: 2.25e-02, grad_scale: 1.0 2023-10-09 11:42:47,039 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-09 11:42:54,840 INFO [train.py:1063] (3/4) Epoch 1, validation: loss=7.75, simple_loss=7.06, pruned_loss=6.883, over 1020973.00 frames. 2023-10-09 11:42:54,840 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 13384MB 2023-10-09 11:42:57,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=0.0, ans=0.9 2023-10-09 11:43:07,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=46.666666666666664, ans=0.24953333333333333 2023-10-09 11:43:12,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=42.19 vs. limit=7.535 2023-10-09 11:43:20,338 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=71.24 vs. limit=4.037333333333334 2023-10-09 11:43:22,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=93.33333333333333, ans=7.535 2023-10-09 11:43:24,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=301.24 vs. limit=7.535 2023-10-09 11:43:30,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=86.92 vs. limit=7.5525 2023-10-09 11:43:38,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=262.92 vs. limit=7.5525 2023-10-09 11:43:51,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=186.66666666666666, ans=0.49125 2023-10-09 11:44:06,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=75.46 vs. limit=7.5875 2023-10-09 11:44:11,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=95.42 vs. limit=7.71 2023-10-09 11:44:11,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=3.042 2023-10-09 11:44:12,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=280.0, ans=0.09825 2023-10-09 11:44:16,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.64 vs. limit=7.71 2023-10-09 11:44:26,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=16.52 vs. limit=5.081666666666667 2023-10-09 11:44:39,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.99 vs. limit=4.149333333333333 2023-10-09 11:44:39,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=373.3333333333333, ans=0.4825 2023-10-09 11:44:51,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=420.0, ans=0.2458 2023-10-09 11:44:52,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=420.0, ans=0.09055 2023-10-09 11:44:52,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=13.85 vs. limit=4.168 2023-10-09 11:44:58,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=26.43 vs. limit=7.6575 2023-10-09 11:45:02,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.00 vs. limit=7.85 2023-10-09 11:45:05,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 8.077e+01 1.311e+02 3.462e+02 3.055e+03 2.464e+04, threshold=6.925e+02, percent-clipped=0.0 2023-10-09 11:45:11,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=16.52 vs. limit=4.1866666666666665 2023-10-09 11:45:15,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=513.3333333333334, ans=0.04839583333333333 2023-10-09 11:45:20,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=23.68 vs. limit=5.256666666666667 2023-10-09 11:45:27,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=112.99 vs. limit=5.0 2023-10-09 11:45:29,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=22.30 vs. limit=5.28 2023-10-09 11:45:31,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=270.29 vs. limit=5.28 2023-10-09 11:45:37,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=59.32 vs. limit=5.303333333333334 2023-10-09 11:45:40,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=120.76 vs. limit=7.955 2023-10-09 11:45:44,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=3.091 2023-10-09 11:45:47,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=4.242666666666667 2023-10-09 11:45:47,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=38.05 vs. limit=7.7275 2023-10-09 11:45:48,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=606.6666666666666, ans=0.4715625 2023-10-09 11:45:52,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=653.3333333333334, ans=0.469375 2023-10-09 11:45:54,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=5.163333333333333 2023-10-09 11:45:55,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=127.84 vs. limit=7.745 2023-10-09 11:46:06,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=176.19 vs. limit=5.35 2023-10-09 11:46:08,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=188.10 vs. limit=7.7625 2023-10-09 11:46:10,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=700.0, ans=7.7625 2023-10-09 11:46:17,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.15 vs. limit=8.06 2023-10-09 11:46:18,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.34 vs. limit=5.373333333333333 2023-10-09 11:46:25,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=39.50 vs. limit=7.78 2023-10-09 11:46:40,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=4.336 2023-10-09 11:46:43,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=52.75 vs. limit=7.815 2023-10-09 11:46:43,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=27.34 vs. limit=7.815 2023-10-09 11:46:53,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=886.6666666666666, ans=0.08005000000000001 2023-10-09 11:46:54,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=28.33 vs. limit=7.8325 2023-10-09 11:46:57,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=886.6666666666666, ans=5.554166666666666 2023-10-09 11:46:58,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=886.6666666666666, ans=0.2911333333333333 2023-10-09 11:47:09,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.523e+01 8.467e+01 9.386e+01 1.115e+02 3.085e+03, threshold=1.877e+02, percent-clipped=1.0 2023-10-09 11:47:10,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=933.3333333333334, ans=0.45625 2023-10-09 11:47:23,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.52 vs. limit=5.245 2023-10-09 11:47:24,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=77.53 vs. limit=7.8675 2023-10-09 11:47:29,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026.6666666666667, ans=0.28973333333333334 2023-10-09 11:47:30,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1026.6666666666667, ans=0.8640666666666666 2023-10-09 11:47:32,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1026.6666666666667, ans=5.256666666666667 2023-10-09 11:47:41,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=8.305 2023-10-09 11:47:41,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=33.97 vs. limit=7.9025 2023-10-09 11:47:42,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=16.32 vs. limit=7.9025 2023-10-09 11:47:44,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=9.09 vs. limit=4.429333333333333 2023-10-09 11:47:52,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=49.32 vs. limit=7.9025 2023-10-09 11:48:01,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=175.04 vs. limit=5.5600000000000005 2023-10-09 11:48:03,369 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.32 vs. limit=5.28 2023-10-09 11:48:11,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=133.70 vs. limit=5.583333333333333 2023-10-09 11:48:13,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=8.375 2023-10-09 11:48:14,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=57.24 vs. limit=7.9375 2023-10-09 11:48:23,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=4.466666666666667 2023-10-09 11:48:26,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1213.3333333333333, ans=0.28786666666666666 2023-10-09 11:48:29,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1213.3333333333333, ans=0.1545 2023-10-09 11:48:32,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.44 vs. limit=8.41 2023-10-09 11:48:41,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.51 vs. limit=8.445 2023-10-09 11:48:41,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=7.9725 2023-10-09 11:48:46,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1260.0, ans=0.5 2023-10-09 11:48:50,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=23.70 vs. limit=7.99 2023-10-09 11:49:04,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1353.3333333333333, ans=0.14925 2023-10-09 11:49:05,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1353.3333333333333, ans=0.28646666666666665 2023-10-09 11:49:11,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1353.3333333333333, ans=0.009413333333333333 2023-10-09 11:49:12,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1353.3333333333333, ans=0.4365625 2023-10-09 11:49:18,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.74 vs. limit=5.7 2023-10-09 11:49:18,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 7.179e+01 9.899e+01 1.175e+02 1.431e+02 2.633e+02, threshold=2.350e+02, percent-clipped=8.0 2023-10-09 11:49:19,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1400.0, ans=0.434375 2023-10-09 11:49:21,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1400.0, ans=0.325 2023-10-09 11:49:29,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.64 vs. limit=8.0425 2023-10-09 11:49:35,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.28 vs. limit=4.578666666666667 2023-10-09 11:49:40,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=8.06 2023-10-09 11:49:50,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1493.3333333333333, ans=0.43 2023-10-09 11:49:50,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.19 vs. limit=8.620000000000001 2023-10-09 11:50:05,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=8.095 2023-10-09 11:50:16,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=8.725 2023-10-09 11:50:19,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.99 vs. limit=5.408333333333333 2023-10-09 11:50:22,130 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=8.1125 2023-10-09 11:50:25,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.95 vs. limit=8.1125 2023-10-09 11:50:26,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1633.3333333333333, ans=0.4234375 2023-10-09 11:50:27,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.33 vs. limit=5.84 2023-10-09 11:50:29,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.35 vs. limit=4.672 2023-10-09 11:50:30,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1680.0, ans=0.137 2023-10-09 11:50:32,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1680.0, ans=0.42125 2023-10-09 11:50:44,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1726.6666666666667, ans=6.079166666666667 2023-10-09 11:50:54,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=39.60 vs. limit=8.165 2023-10-09 11:50:54,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.19 vs. limit=8.165 2023-10-09 11:50:56,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.84 vs. limit=8.165 2023-10-09 11:51:02,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.52 vs. limit=8.83 2023-10-09 11:51:03,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1820.0, ans=0.26134999999999997 2023-10-09 11:51:07,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.89 vs. limit=8.865 2023-10-09 11:51:18,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 8.514e+01 1.131e+02 1.381e+02 1.799e+02 3.070e+02, threshold=2.763e+02, percent-clipped=7.0 2023-10-09 11:51:31,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1913.3333333333333, ans=0.05695 2023-10-09 11:51:40,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=8.97 2023-10-09 11:51:46,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1960.0, ans=0.408125 2023-10-09 11:51:47,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=8.235 2023-10-09 11:51:50,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=25.48 vs. limit=8.235 2023-10-09 11:51:56,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2006.6666666666667, ans=0.8297666666666667 2023-10-09 11:52:00,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=8.2525 2023-10-09 11:52:19,580 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.45 vs. limit=9.040000000000001 2023-10-09 11:52:20,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2100.0, ans=0.2375 2023-10-09 11:52:20,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=9.075 2023-10-09 11:52:24,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=8.2875 2023-10-09 11:52:28,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=8.2875 2023-10-09 11:52:29,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2100.0, ans=0.27899999999999997 2023-10-09 11:52:29,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=9.075 2023-10-09 11:52:37,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2146.6666666666665, ans=0.2785333333333333 2023-10-09 11:52:45,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2193.3333333333335, ans=0.3971875 2023-10-09 11:52:54,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.32 vs. limit=8.3225 2023-10-09 11:53:01,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2240.0, ans=0.0496 2023-10-09 11:53:05,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.25 vs. limit=9.18 2023-10-09 11:53:16,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2286.6666666666665, ans=0.21416666666666667 2023-10-09 11:53:23,638 INFO [train.py:1031] (3/4) Epoch 1, batch 500, loss[loss=0.8931, simple_loss=0.7623, pruned_loss=0.6954, over 16849.00 frames. ], tot_loss[loss=1.277, simple_loss=1.111, pruned_loss=1.15, over 7281188.37 frames. ], batch size: 130, lr: 4.49e-02, grad_scale: 8.0 2023-10-09 11:53:26,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=8.375 2023-10-09 11:53:26,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.209e+02 1.842e+02 2.459e+02 3.238e+02 5.482e+02, threshold=4.917e+02, percent-clipped=35.0 2023-10-09 11:53:36,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2380.0, ans=0.11075 2023-10-09 11:53:41,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.33 vs. limit=8.3925 2023-10-09 11:53:43,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=2380.0, ans=0.3884375 2023-10-09 11:53:48,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.98 vs. limit=9.32 2023-10-09 11:53:50,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2426.6666666666665, ans=0.109 2023-10-09 11:53:50,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.68 vs. limit=8.41 2023-10-09 11:53:52,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2426.6666666666665, ans=0.8150666666666667 2023-10-09 11:53:55,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=8.41 2023-10-09 11:53:57,208 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=4.970666666666666 2023-10-09 11:53:58,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.66 vs. limit=5.618333333333333 2023-10-09 11:54:06,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2473.3333333333335, ans=0.3840625 2023-10-09 11:54:19,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.85 vs. limit=5.008 2023-10-09 11:54:21,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.30 vs. limit=8.4625 2023-10-09 11:54:44,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2613.3333333333335, ans=0.102 2023-10-09 11:54:51,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.24 vs. limit=9.495000000000001 2023-10-09 11:54:53,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2660.0, ans=0.2734 2023-10-09 11:54:55,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2660.0, ans=0.3753125 2023-10-09 11:54:57,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=8.4975 2023-10-09 11:55:12,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=8.5325 2023-10-09 11:55:14,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2753.3333333333335, ans=0.37093750000000003 2023-10-09 11:55:18,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.02 vs. limit=5.101333333333334 2023-10-09 11:55:27,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 2.382e+02 3.435e+02 4.686e+02 1.303e+03, threshold=6.869e+02, percent-clipped=20.0 2023-10-09 11:55:30,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2800.0, ans=0.27199999999999996 2023-10-09 11:55:46,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2846.6666666666665, ans=0.3665625 2023-10-09 11:55:57,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=2893.3333333333335, ans=0.364375 2023-10-09 11:55:58,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2893.3333333333335, ans=0.2710666666666667 2023-10-09 11:56:07,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=9.705 2023-10-09 11:56:16,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=9.74 2023-10-09 11:56:29,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=3033.3333333333335, ans=0.03174999999999999 2023-10-09 11:56:52,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=3126.6666666666665, ans=0.07 2023-10-09 11:56:53,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3126.6666666666665, ans=0.35343749999999996 2023-10-09 11:56:57,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=3173.3333333333335, ans=0.35125 2023-10-09 11:57:03,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=8.69 2023-10-09 11:57:11,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=3220.0, ans=0.3490625 2023-10-09 11:57:14,951 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=2.576e+00 2023-10-09 11:57:23,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.664e+02 4.137e+02 6.547e+02 1.066e+03, threshold=8.274e+02, percent-clipped=23.0 2023-10-09 11:57:39,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=3313.3333333333335, ans=0.26686666666666664 2023-10-09 11:57:44,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=3360.0, ans=7.1 2023-10-09 11:57:51,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.99 vs. limit=6.68 2023-10-09 11:57:59,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=3406.6666666666665, ans=0.3403125 2023-10-09 11:58:04,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=3406.6666666666665, ans=0.3403125 2023-10-09 11:58:11,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3453.3333333333335, ans=0.338125 2023-10-09 11:58:29,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=8.8125 2023-10-09 11:58:32,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=3546.6666666666665, ans=0.7758666666666667 2023-10-09 11:58:52,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=3593.3333333333335, ans=0.3315625 2023-10-09 11:59:01,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=3640.0, ans=0.7726 2023-10-09 11:59:03,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=3640.0, ans=0.044999999999999984 2023-10-09 11:59:14,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=10.265 2023-10-09 11:59:21,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.771e+02 4.569e+02 6.440e+02 2.053e+03, threshold=9.137e+02, percent-clipped=16.0 2023-10-09 11:59:30,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=3780.0, ans=0.7878 2023-10-09 11:59:32,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=3780.0, ans=0.058249999999999996 2023-10-09 11:59:41,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=3826.6666666666665, ans=0.013899999999999996 2023-10-09 11:59:45,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=3826.6666666666665, ans=0.320625 2023-10-09 11:59:51,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=8.935 2023-10-09 12:00:28,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=8.9875 2023-10-09 12:00:33,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=5.586666666666667 2023-10-09 12:00:36,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=4013.3333333333335, ans=0.07491666666666667 2023-10-09 12:00:40,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.12 vs. limit=9.004999999999999 2023-10-09 12:00:55,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.68 vs. limit=7.029999999999999 2023-10-09 12:00:58,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=10.58 2023-10-09 12:01:02,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=4106.666666666667, ans=0.3075 2023-10-09 12:01:08,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=4153.333333333333, ans=0.3053125 2023-10-09 12:01:14,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=4153.333333333333, ans=0.3053125 2023-10-09 12:01:24,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.785e+02 4.442e+02 8.351e+02 2.552e+03, threshold=8.884e+02, percent-clipped=21.0 2023-10-09 12:01:42,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=4293.333333333333, ans=0.04877777777777778 2023-10-09 12:01:47,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=4293.333333333333, ans=0.009936231884057971 2023-10-09 12:01:56,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4340.0, ans=0.2566 2023-10-09 12:02:28,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=10.825 2023-10-09 12:02:36,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=3.672 2023-10-09 12:02:45,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4526.666666666667, ans=0.2547333333333333 2023-10-09 12:02:46,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=4526.666666666667, ans=0.2878125 2023-10-09 12:02:47,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.97 vs. limit=6.131666666666667 2023-10-09 12:02:52,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=10.895 2023-10-09 12:03:12,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.34 vs. limit=6.155 2023-10-09 12:03:15,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=4666.666666666667, ans=0.7366666666666667 2023-10-09 12:03:16,433 INFO [train.py:1031] (3/4) Epoch 1, batch 1000, loss[loss=0.5628, simple_loss=0.5261, pruned_loss=0.3059, over 16185.00 frames. ], tot_loss[loss=0.9557, simple_loss=0.8378, pruned_loss=0.7627, over 12905453.14 frames. ], batch size: 50, lr: 4.48e-02, grad_scale: 8.0 2023-10-09 12:03:20,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 3.290e+02 5.657e+02 8.681e+02 2.028e+03, threshold=1.131e+03, percent-clipped=23.0 2023-10-09 12:03:21,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.55 vs. limit=11.0 2023-10-09 12:03:24,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=4666.666666666667, ans=0.28125 2023-10-09 12:03:30,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=4713.333333333333, ans=0.009844927536231885 2023-10-09 12:03:41,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=4760.0, ans=0.009834782608695653 2023-10-09 12:03:41,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=5.904 2023-10-09 12:03:46,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=9.285 2023-10-09 12:03:50,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=4806.666666666667, ans=0.04663888888888889 2023-10-09 12:03:57,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.63 vs. limit=9.3025 2023-10-09 12:04:00,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=4853.333333333333, ans=0.27249999999999996 2023-10-09 12:04:09,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4853.333333333333, ans=0.25146666666666667 2023-10-09 12:04:12,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=4900.0, ans=0.04625 2023-10-09 12:04:14,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=4900.0, ans=0.7285 2023-10-09 12:04:24,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=4946.666666666667, ans=0.2 2023-10-09 12:04:29,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.40 vs. limit=11.21 2023-10-09 12:04:29,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=3.742 2023-10-09 12:04:34,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=4993.333333333333, ans=0.06879166666666667 2023-10-09 12:04:39,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=4993.333333333333, ans=0.045861111111111116 2023-10-09 12:04:48,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=9.39 2023-10-09 12:04:48,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=5040.0, ans=0.26375000000000004 2023-10-09 12:04:53,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=5086.666666666667, ans=0.045472222222222226 2023-10-09 12:04:54,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5086.666666666667, ans=0.26156250000000003 2023-10-09 12:05:02,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=5086.666666666667, ans=0.045472222222222226 2023-10-09 12:05:09,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.896e+02 4.699e+02 7.921e+02 2.032e+03, threshold=9.398e+02, percent-clipped=10.0 2023-10-09 12:05:21,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.24 vs. limit=9.442499999999999 2023-10-09 12:05:52,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=5273.333333333333, ans=0.0 2023-10-09 12:05:53,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5273.333333333333, ans=0.009723188405797101 2023-10-09 12:05:54,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=5273.333333333333, ans=0.2528125 2023-10-09 12:06:07,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5320.0, ans=0.2468 2023-10-09 12:06:11,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5366.666666666667, ans=0.24843749999999998 2023-10-09 12:06:17,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=9.5125 2023-10-09 12:06:18,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=5366.666666666667, ans=11.525 2023-10-09 12:06:34,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=5413.333333333333, ans=0.044111111111111115 2023-10-09 12:06:34,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=5413.333333333333, ans=11.559999999999999 2023-10-09 12:06:46,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=9.5475 2023-10-09 12:06:54,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=5506.666666666667, ans=0.043722222222222225 2023-10-09 12:06:54,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.50 vs. limit=6.376666666666667 2023-10-09 12:06:56,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=5506.666666666667, ans=0.03279166666666666 2023-10-09 12:07:04,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=3.833 2023-10-09 12:07:06,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=3.833 2023-10-09 12:07:08,995 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:07:16,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.917e+02 4.746e+02 7.001e+02 2.391e+03, threshold=9.492e+02, percent-clipped=17.0 2023-10-09 12:07:17,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=6.24 2023-10-09 12:07:18,781 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.539e-03 2023-10-09 12:07:23,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.83 vs. limit=9.6175 2023-10-09 12:07:26,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.28 vs. limit=11.735 2023-10-09 12:07:27,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=11.735 2023-10-09 12:07:54,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.07 vs. limit=9.6525 2023-10-09 12:08:22,977 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.78 vs. limit=7.9399999999999995 2023-10-09 12:08:48,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=5973.333333333333, ans=0.04177777777777778 2023-10-09 12:08:56,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=6020.0, ans=0.009560869565217392 2023-10-09 12:09:10,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.626e+02 5.347e+02 8.264e+02 2.974e+03, threshold=1.069e+03, percent-clipped=18.0 2023-10-09 12:09:28,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6160.0, ans=0.21125 2023-10-09 12:09:33,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6160.0, ans=0.0 2023-10-09 12:09:34,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=6160.0, ans=0.21125 2023-10-09 12:09:35,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=6160.0, ans=0.6844 2023-10-09 12:09:40,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=9.8275 2023-10-09 12:09:41,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.60 vs. limit=12.155000000000001 2023-10-09 12:09:51,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=6.501333333333333 2023-10-09 12:09:56,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=6253.333333333333, ans=0.20687499999999998 2023-10-09 12:09:56,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=6253.333333333333, ans=0.1 2023-10-09 12:10:11,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=6346.666666666667, ans=0.6778666666666666 2023-10-09 12:10:12,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.64 vs. limit=9.879999999999999 2023-10-09 12:10:12,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.48 vs. limit=6.586666666666667 2023-10-09 12:10:13,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=6346.666666666667, ans=0.04022222222222222 2023-10-09 12:10:17,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=6346.666666666667, ans=0.23653333333333332 2023-10-09 12:10:23,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.75 vs. limit=12.295 2023-10-09 12:10:27,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6393.333333333333, ans=0.23606666666666665 2023-10-09 12:10:50,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=6486.666666666667, ans=0.2973 2023-10-09 12:11:05,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.894e+02 4.335e+02 6.370e+02 1.607e+03, threshold=8.670e+02, percent-clipped=8.0 2023-10-09 12:11:10,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=6580.0, ans=0.03925000000000001 2023-10-09 12:11:14,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=6580.0, ans=0.2987 2023-10-09 12:11:15,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=6580.0, ans=0.19156250000000002 2023-10-09 12:11:24,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=6626.666666666667, ans=0.03905555555555556 2023-10-09 12:11:38,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=12.504999999999999 2023-10-09 12:11:39,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.05 vs. limit=12.504999999999999 2023-10-09 12:11:43,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=6673.333333333333, ans=0.009418840579710146 2023-10-09 12:11:49,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=10.02 2023-10-09 12:11:57,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=10.0375 2023-10-09 12:12:02,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=6766.666666666667, ans=0.035 2023-10-09 12:12:18,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.95 vs. limit=4.022 2023-10-09 12:12:28,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=6860.0, ans=0.17843750000000003 2023-10-09 12:12:44,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=12.715 2023-10-09 12:12:56,376 INFO [train.py:1031] (3/4) Epoch 1, batch 1500, loss[loss=0.5497, simple_loss=0.5087, pruned_loss=0.299, over 16518.00 frames. ], tot_loss[loss=0.7783, simple_loss=0.6985, pruned_loss=0.5604, over 17329082.55 frames. ], batch size: 266, lr: 4.46e-02, grad_scale: 8.0 2023-10-09 12:13:01,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.328e+02 5.079e+02 8.132e+02 1.447e+03, threshold=1.016e+03, percent-clipped=21.0 2023-10-09 12:13:18,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=7093.333333333333, ans=0.16749999999999998 2023-10-09 12:13:40,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=7140.0, ans=0.16531249999999997 2023-10-09 12:13:57,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7233.333333333333, ans=0.22766666666666668 2023-10-09 12:14:02,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=7233.333333333333, ans=0.1609375 2023-10-09 12:14:09,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7280.0, ans=0.15875 2023-10-09 12:14:21,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=10.2475 2023-10-09 12:14:35,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=7373.333333333333, ans=0.15437499999999998 2023-10-09 12:14:51,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=7466.666666666667, ans=10.3 2023-10-09 12:14:56,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 3.125e+02 4.733e+02 7.250e+02 1.707e+03, threshold=9.465e+02, percent-clipped=8.0 2023-10-09 12:15:02,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7513.333333333333, ans=0.1478125 2023-10-09 12:15:07,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7513.333333333333, ans=0.22486666666666666 2023-10-09 12:15:07,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7513.333333333333, ans=0.22486666666666666 2023-10-09 12:15:30,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=7606.666666666667, ans=0.034972222222222224 2023-10-09 12:15:33,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.87 vs. limit=6.901666666666667 2023-10-09 12:15:42,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=13.24 2023-10-09 12:16:05,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=7746.666666666667, ans=0.6288666666666667 2023-10-09 12:16:08,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.19 vs. limit=7.0986666666666665 2023-10-09 12:16:15,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=7793.333333333333, ans=0.13468750000000002 2023-10-09 12:16:20,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=7793.333333333333, ans=0.17206666666666665 2023-10-09 12:16:43,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=7886.666666666667, ans=0.1303125 2023-10-09 12:16:51,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7933.333333333333, ans=0.128125 2023-10-09 12:16:51,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 3.612e+02 5.663e+02 8.176e+02 1.459e+03, threshold=1.133e+03, percent-clipped=16.0 2023-10-09 12:16:53,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7933.333333333333, ans=0.128125 2023-10-09 12:16:59,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=7980.0, ans=0.12593749999999998 2023-10-09 12:17:07,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=8026.666666666667, ans=0.125 2023-10-09 12:17:10,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=8026.666666666667, ans=0.00912463768115942 2023-10-09 12:17:37,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=8120.0, ans=0.6158 2023-10-09 12:17:44,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=8166.666666666667, ans=0.125 2023-10-09 12:17:55,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8213.333333333334, ans=0.125 2023-10-09 12:18:29,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=8353.333333333334, ans=0.009053623188405796 2023-10-09 12:18:34,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.04 vs. limit=7.088333333333333 2023-10-09 12:18:36,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=8400.0, ans=0.07 2023-10-09 12:18:42,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 3.379e+02 4.819e+02 7.326e+02 1.735e+03, threshold=9.639e+02, percent-clipped=7.0 2023-10-09 12:18:43,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=7.359999999999999 2023-10-09 12:18:54,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.86 vs. limit=9.223333333333333 2023-10-09 12:19:04,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=8493.333333333334, ans=0.21506666666666666 2023-10-09 12:19:07,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=8493.333333333334, ans=0.05 2023-10-09 12:19:19,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8540.0, ans=0.2146 2023-10-09 12:19:32,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=8633.333333333334, ans=0.07 2023-10-09 12:19:53,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=8726.666666666666, ans=0.125 2023-10-09 12:20:02,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.44 vs. limit=10.772499999999999 2023-10-09 12:20:21,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=8820.0, ans=0.125 2023-10-09 12:20:32,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.144e+02 4.479e+02 6.776e+02 1.167e+03, threshold=8.957e+02, percent-clipped=4.0 2023-10-09 12:20:32,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=8866.666666666666, ans=0.125 2023-10-09 12:20:37,872 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=10.8425 2023-10-09 12:20:42,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=8913.333333333334, ans=0.5880333333333334 2023-10-09 12:20:43,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=8913.333333333334, ans=0.125 2023-10-09 12:20:45,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=8913.333333333334, ans=0.029527777777777778 2023-10-09 12:20:46,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=8960.0, ans=0.008921739130434782 2023-10-09 12:20:51,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.43 vs. limit=14.219999999999999 2023-10-09 12:21:34,319 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=14.325 2023-10-09 12:21:58,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=10.9475 2023-10-09 12:22:00,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=9193.333333333334, ans=0.02836111111111111 2023-10-09 12:22:02,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=9193.333333333334, ans=0.125 2023-10-09 12:22:18,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=9.620000000000001 2023-10-09 12:22:23,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=9286.666666666666, ans=0.0 2023-10-09 12:22:24,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=9286.666666666666, ans=0.00885072463768116 2023-10-09 12:22:24,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=9286.666666666666, ans=0.00885072463768116 2023-10-09 12:22:32,367 INFO [train.py:1031] (3/4) Epoch 1, batch 2000, loss[loss=0.4749, simple_loss=0.4853, pruned_loss=0.2322, over 16851.00 frames. ], tot_loss[loss=0.673, simple_loss=0.6214, pruned_loss=0.4466, over 20772306.45 frames. ], batch size: 72, lr: 4.42e-02, grad_scale: 32.0 2023-10-09 12:22:38,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 3.210e+02 4.167e+02 6.978e+02 1.336e+03, threshold=8.334e+02, percent-clipped=13.0 2023-10-09 12:22:45,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=9380.0, ans=0.027583333333333335 2023-10-09 12:22:48,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=9380.0, ans=0.2062 2023-10-09 12:23:21,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=9473.333333333334, ans=0.125 2023-10-09 12:23:31,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=9520.0, ans=0.125 2023-10-09 12:23:33,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.89 vs. limit=14.64 2023-10-09 12:23:47,008 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:24:05,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9660.0, ans=0.2034 2023-10-09 12:24:08,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9660.0, ans=0.125 2023-10-09 12:24:13,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=9706.666666666666, ans=0.5602666666666667 2023-10-09 12:24:17,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=9706.666666666666, ans=0.125 2023-10-09 12:24:48,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=9800.0, ans=0.0 2023-10-09 12:24:48,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=9800.0, ans=0.008739130434782609 2023-10-09 12:24:49,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.735e+02 5.977e+02 7.499e+02 1.921e+03, threshold=1.195e+03, percent-clipped=15.0 2023-10-09 12:25:01,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.43 vs. limit=11.192499999999999 2023-10-09 12:25:01,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.44 vs. limit=4.477 2023-10-09 12:25:10,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=4.477 2023-10-09 12:25:19,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9893.333333333334, ans=0.125 2023-10-09 12:25:25,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.13 vs. limit=14.92 2023-10-09 12:25:50,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.29 vs. limit=4.498 2023-10-09 12:25:58,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=10033.333333333334, ans=0.5488333333333334 2023-10-09 12:25:59,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.65 vs. limit=15.025 2023-10-09 12:26:12,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.29 vs. limit=15.059999999999999 2023-10-09 12:26:24,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=10126.666666666666, ans=0.125 2023-10-09 12:26:30,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=10126.666666666666, ans=0.024472222222222225 2023-10-09 12:26:31,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=10126.666666666666, ans=0.125 2023-10-09 12:26:39,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=10173.333333333334, ans=0.024277777777777773 2023-10-09 12:26:48,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=10220.0, ans=0.125 2023-10-09 12:26:52,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.59 vs. limit=11.3325 2023-10-09 12:26:56,426 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:27:02,866 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:27:04,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 3.243e+02 4.017e+02 4.955e+02 1.108e+03, threshold=8.035e+02, percent-clipped=0.0 2023-10-09 12:27:04,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=10266.666666666666, ans=0.025 2023-10-09 12:27:04,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=10266.666666666666, ans=0.19733333333333333 2023-10-09 12:27:07,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=10266.666666666666, ans=0.19733333333333333 2023-10-09 12:27:20,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=10360.0, ans=0.5374000000000001 2023-10-09 12:27:55,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=10500.0, ans=0.5325000000000001 2023-10-09 12:28:12,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=10546.666666666666, ans=0.125 2023-10-09 12:28:12,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=10546.666666666666, ans=0.022722222222222227 2023-10-09 12:28:17,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10593.333333333334, ans=0.19406666666666667 2023-10-09 12:28:24,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.18 vs. limit=15.48 2023-10-09 12:28:51,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.870e+02 3.696e+02 5.862e+02 1.279e+03, threshold=7.393e+02, percent-clipped=10.0 2023-10-09 12:29:07,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=10826.666666666666, ans=0.02155555555555556 2023-10-09 12:29:10,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=10826.666666666666, ans=0.125 2023-10-09 12:29:32,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=15.69 2023-10-09 12:29:39,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=11.594999999999999 2023-10-09 12:30:17,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=11.665 2023-10-09 12:30:23,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.83 vs. limit=11.665 2023-10-09 12:30:44,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.492e+02 4.927e+02 6.350e+02 1.075e+03, threshold=9.853e+02, percent-clipped=18.0 2023-10-09 12:30:49,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=11.717500000000001 2023-10-09 12:30:54,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=11246.666666666666, ans=0.125 2023-10-09 12:31:07,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.34 vs. limit=4.694 2023-10-09 12:31:39,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=11433.333333333334, ans=0.125 2023-10-09 12:31:50,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=11480.0, ans=0.125 2023-10-09 12:32:00,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=11526.666666666666, ans=0.125 2023-10-09 12:32:04,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=11526.666666666666, ans=0.125 2023-10-09 12:32:07,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.59 vs. limit=11.8225 2023-10-09 12:32:22,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.41 vs. limit=16.18 2023-10-09 12:32:24,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=11620.0, ans=0.125 2023-10-09 12:32:27,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=11620.0, ans=0.125 2023-10-09 12:32:33,893 INFO [train.py:1031] (3/4) Epoch 1, batch 2500, loss[loss=0.4768, simple_loss=0.4795, pruned_loss=0.2371, over 16577.00 frames. ], tot_loss[loss=0.6012, simple_loss=0.5715, pruned_loss=0.3737, over 23459418.90 frames. ], batch size: 241, lr: 4.38e-02, grad_scale: 32.0 2023-10-09 12:32:40,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 3.216e+02 4.121e+02 4.962e+02 1.193e+03, threshold=8.242e+02, percent-clipped=2.0 2023-10-09 12:32:44,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.87 vs. limit=16.25 2023-10-09 12:32:49,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11713.333333333334, ans=0.18286666666666668 2023-10-09 12:32:55,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=11713.333333333334, ans=0.125 2023-10-09 12:33:02,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=11760.0, ans=0.01766666666666667 2023-10-09 12:33:23,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.93 vs. limit=10.926666666666666 2023-10-09 12:33:32,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=4.785 2023-10-09 12:33:33,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11900.0, ans=0.181 2023-10-09 12:33:43,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=11946.666666666666, ans=0.125 2023-10-09 12:34:02,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=12.015 2023-10-09 12:34:07,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=12040.0, ans=0.00825217391304348 2023-10-09 12:34:14,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=12086.666666666666, ans=0.125 2023-10-09 12:34:30,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.833e+02 3.824e+02 4.828e+02 1.178e+03, threshold=7.649e+02, percent-clipped=4.0 2023-10-09 12:34:38,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=12180.0, ans=0.01591666666666667 2023-10-09 12:34:53,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=12226.666666666666, ans=0.125 2023-10-09 12:35:11,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=12320.0, ans=0.008191304347826087 2023-10-09 12:35:19,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=12320.0, ans=0.0 2023-10-09 12:35:21,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=12366.666666666666, ans=0.035 2023-10-09 12:35:29,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=12366.666666666666, ans=0.125 2023-10-09 12:35:36,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12413.333333333334, ans=0.0 2023-10-09 12:35:38,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=12413.333333333334, ans=0.125 2023-10-09 12:35:49,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=12460.0, ans=0.46390000000000003 2023-10-09 12:36:01,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=12506.666666666666, ans=0.014555555555555558 2023-10-09 12:36:24,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=12600.0, ans=0.014166666666666668 2023-10-09 12:36:32,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.893e+02 3.711e+02 4.618e+02 8.918e+02, threshold=7.422e+02, percent-clipped=4.0 2023-10-09 12:36:44,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.78 vs. limit=16.985 2023-10-09 12:36:47,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=12646.666666666666, ans=0.125 2023-10-09 12:37:01,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.81 vs. limit=17.055 2023-10-09 12:37:02,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=4.911 2023-10-09 12:37:02,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=12740.0, ans=0.013583333333333336 2023-10-09 12:37:05,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=12740.0, ans=0.125 2023-10-09 12:37:07,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.78 vs. limit=17.055 2023-10-09 12:37:22,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.70 vs. limit=12.295 2023-10-09 12:37:26,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=12833.333333333334, ans=0.125 2023-10-09 12:37:40,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12880.0, ans=0.125 2023-10-09 12:37:55,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=12926.666666666666, ans=0.008059420289855073 2023-10-09 12:38:21,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=12.3825 2023-10-09 12:38:28,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=8.266666666666666 2023-10-09 12:38:29,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=13066.666666666666, ans=0.4426666666666667 2023-10-09 12:38:31,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.848e+02 3.625e+02 4.845e+02 1.076e+03, threshold=7.249e+02, percent-clipped=7.0 2023-10-09 12:38:37,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13113.333333333334, ans=0.0 2023-10-09 12:38:48,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=13113.333333333334, ans=0.125 2023-10-09 12:39:11,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.31 vs. limit=17.405 2023-10-09 12:39:14,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13206.666666666666, ans=0.16793333333333335 2023-10-09 12:39:22,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=13253.333333333334, ans=0.125 2023-10-09 12:39:28,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13253.333333333334, ans=0.16746666666666665 2023-10-09 12:39:36,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.42 vs. limit=11.65 2023-10-09 12:39:50,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.93 vs. limit=17.509999999999998 2023-10-09 12:40:15,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=13440.0, ans=0.010666666666666672 2023-10-09 12:40:41,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.784e+02 3.175e+02 4.080e+02 9.135e+02, threshold=6.351e+02, percent-clipped=2.0 2023-10-09 12:40:41,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=13533.333333333334, ans=0.007927536231884058 2023-10-09 12:40:47,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=13580.0, ans=0.007917391304347826 2023-10-09 12:41:00,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=13626.666666666666, ans=0.125 2023-10-09 12:41:01,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=13626.666666666666, ans=0.125 2023-10-09 12:41:22,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=13720.0, ans=0.009500000000000001 2023-10-09 12:41:26,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=13720.0, ans=0.009500000000000001 2023-10-09 12:41:27,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=13720.0, ans=0.125 2023-10-09 12:41:29,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13720.0, ans=0.0 2023-10-09 12:41:30,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=11.86 2023-10-09 12:41:42,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.23 vs. limit=17.825 2023-10-09 12:41:49,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=13813.333333333334, ans=0.4165333333333333 2023-10-09 12:41:51,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=13813.333333333334, ans=0.125 2023-10-09 12:42:22,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=13953.333333333334, ans=0.008527777777777773 2023-10-09 12:42:28,770 INFO [train.py:1031] (3/4) Epoch 1, batch 3000, loss[loss=0.4, simple_loss=0.3978, pruned_loss=0.2011, over 12682.00 frames. ], tot_loss[loss=0.548, simple_loss=0.5346, pruned_loss=0.3222, over 25541751.41 frames. ], batch size: 440, lr: 4.34e-02, grad_scale: 32.0 2023-10-09 12:42:29,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=14000.0, ans=0.125 2023-10-09 12:42:35,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.881e+02 3.551e+02 4.499e+02 9.501e+02, threshold=7.102e+02, percent-clipped=6.0 2023-10-09 12:42:45,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=14046.666666666666, ans=0.05 2023-10-09 12:43:05,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=14140.0, ans=0.4051 2023-10-09 12:43:06,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=14140.0, ans=0.00775 2023-10-09 12:43:14,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=14186.666666666666, ans=0.007555555555555558 2023-10-09 12:43:21,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=12.82 2023-10-09 12:43:39,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=14233.333333333334, ans=0.125 2023-10-09 12:43:54,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=5.149 2023-10-09 12:43:55,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=14326.666666666666, ans=0.125 2023-10-09 12:44:30,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=14466.666666666666, ans=10.0 2023-10-09 12:44:35,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14466.666666666666, ans=0.125 2023-10-09 12:44:35,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 2.862e+02 3.437e+02 4.321e+02 9.893e+02, threshold=6.875e+02, percent-clipped=7.0 2023-10-09 12:44:57,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=14560.0, ans=0.4184 2023-10-09 12:45:05,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=14560.0, ans=0.05 2023-10-09 12:45:13,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=14606.666666666666, ans=0.125 2023-10-09 12:45:23,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.25 vs. limit=8.663333333333334 2023-10-09 12:45:27,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=14653.333333333334, ans=0.005611111111111108 2023-10-09 12:45:29,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=14653.333333333334, ans=0.005611111111111108 2023-10-09 12:45:32,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=14700.0, ans=0.125 2023-10-09 12:45:58,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=14793.333333333334, ans=0.05 2023-10-09 12:45:59,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=14793.333333333334, ans=0.005027777777777777 2023-10-09 12:46:07,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=14840.0, ans=0.004833333333333335 2023-10-09 12:46:15,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=14840.0, ans=0.035 2023-10-09 12:46:16,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=14886.666666666666, ans=0.15113333333333334 2023-10-09 12:46:28,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=5.24 2023-10-09 12:46:31,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=14933.333333333334, ans=0.004444444444444438 2023-10-09 12:46:35,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.754e+02 3.405e+02 4.364e+02 7.679e+02, threshold=6.811e+02, percent-clipped=2.0 2023-10-09 12:47:01,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=15026.666666666666, ans=0.007602898550724638 2023-10-09 12:47:02,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=15026.666666666666, ans=0.004055555555555555 2023-10-09 12:47:06,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=15026.666666666666, ans=0.125 2023-10-09 12:47:07,824 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.43 vs. limit=18.77 2023-10-09 12:47:16,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15073.333333333334, ans=0.14926666666666666 2023-10-09 12:47:17,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=15073.333333333334, ans=0.125 2023-10-09 12:47:18,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=15073.333333333334, ans=0.125 2023-10-09 12:47:50,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=15213.333333333334, ans=0.125 2023-10-09 12:47:57,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=5.282 2023-10-09 12:48:07,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=15260.0, ans=0.125 2023-10-09 12:48:11,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=15260.0, ans=0.0030833333333333338 2023-10-09 12:48:28,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.40 vs. limit=13.2575 2023-10-09 12:48:32,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15353.333333333334, ans=0.14646666666666666 2023-10-09 12:48:41,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 2.718e+02 3.206e+02 4.092e+02 6.646e+02, threshold=6.413e+02, percent-clipped=0.0 2023-10-09 12:48:42,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.81 vs. limit=19.05 2023-10-09 12:48:52,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15446.666666666666, ans=0.14553333333333335 2023-10-09 12:48:55,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=15446.666666666666, ans=0.0023055555555555537 2023-10-09 12:49:09,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=15540.0, ans=0.0019166666666666707 2023-10-09 12:49:23,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=15586.666666666666, ans=0.4338 2023-10-09 12:49:30,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=15633.333333333334, ans=0.35283333333333333 2023-10-09 12:49:43,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=15680.0, ans=0.0013333333333333391 2023-10-09 12:49:43,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=15680.0, ans=0.125 2023-10-09 12:49:58,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=15726.666666666666, ans=0.125 2023-10-09 12:50:01,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=15726.666666666666, ans=0.00745072463768116 2023-10-09 12:50:11,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=15773.333333333334, ans=0.3479333333333333 2023-10-09 12:50:19,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=15773.333333333334, ans=0.007440579710144927 2023-10-09 12:50:26,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15820.0, ans=0.1418 2023-10-09 12:50:29,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=13.432500000000001 2023-10-09 12:50:35,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.70 vs. limit=19.4 2023-10-09 12:50:38,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.754e+02 3.493e+02 4.424e+02 8.379e+02, threshold=6.986e+02, percent-clipped=8.0 2023-10-09 12:50:40,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=15866.666666666666, ans=0.125 2023-10-09 12:51:42,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=16146.666666666666, ans=0.13853333333333334 2023-10-09 12:51:44,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=16146.666666666666, ans=0.13853333333333334 2023-10-09 12:51:45,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=16146.666666666666, ans=0.0 2023-10-09 12:51:59,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=16193.333333333334, ans=0.0 2023-10-09 12:52:20,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=16286.666666666666, ans=19.715 2023-10-09 12:52:32,076 INFO [train.py:1031] (3/4) Epoch 1, batch 3500, loss[loss=0.3944, simple_loss=0.432, pruned_loss=0.1784, over 16231.00 frames. ], tot_loss[loss=0.5099, simple_loss=0.5088, pruned_loss=0.286, over 27179559.68 frames. ], batch size: 50, lr: 4.28e-02, grad_scale: 64.0 2023-10-09 12:52:36,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.762e+02 3.443e+02 4.370e+02 9.215e+02, threshold=6.885e+02, percent-clipped=8.0 2023-10-09 12:52:41,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=16333.333333333334, ans=0.32833333333333337 2023-10-09 12:52:48,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=16380.0, ans=0.125 2023-10-09 12:52:49,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=16380.0, ans=0.3267 2023-10-09 12:52:49,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.32 vs. limit=9.094999999999999 2023-10-09 12:52:54,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16380.0, ans=0.13620000000000002 2023-10-09 12:53:06,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=13.677499999999998 2023-10-09 12:53:14,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=16473.333333333332, ans=0.04949747468305833 2023-10-09 12:53:30,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=16566.666666666668, ans=0.3201666666666667 2023-10-09 12:53:31,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.84 vs. limit=13.7125 2023-10-09 12:53:33,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=16566.666666666668, ans=0.007268115942028985 2023-10-09 12:53:41,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=16613.333333333332, ans=0.125 2023-10-09 12:54:09,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.75 vs. limit=5.4990000000000006 2023-10-09 12:54:17,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=16706.666666666668, ans=0.0 2023-10-09 12:54:23,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.76 vs. limit=20.064999999999998 2023-10-09 12:54:33,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=16800.0, ans=0.0 2023-10-09 12:54:41,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 2.550e+02 3.099e+02 3.752e+02 6.234e+02, threshold=6.198e+02, percent-clipped=0.0 2023-10-09 12:54:43,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=16800.0, ans=0.125 2023-10-09 12:55:00,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=16893.333333333332, ans=0.04949747468305833 2023-10-09 12:55:15,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=16940.0, ans=0.30710000000000004 2023-10-09 12:55:56,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=17080.0, ans=0.125 2023-10-09 12:56:04,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17126.666666666668, ans=0.0 2023-10-09 12:56:19,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17173.333333333332, ans=0.125 2023-10-09 12:56:39,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17266.666666666668, ans=0.12733333333333333 2023-10-09 12:56:40,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.558e+02 3.187e+02 4.054e+02 9.096e+02, threshold=6.373e+02, percent-clipped=3.0 2023-10-09 12:56:47,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.13 vs. limit=20.485 2023-10-09 12:56:52,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.24 vs. limit=13.9925 2023-10-09 12:57:17,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.91 vs. limit=20.555 2023-10-09 12:57:23,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=17406.666666666668, ans=0.0 2023-10-09 12:57:37,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.86 vs. limit=20.625 2023-10-09 12:57:56,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17546.666666666668, ans=0.1245333333333333 2023-10-09 12:58:00,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17546.666666666668, ans=0.1245333333333333 2023-10-09 12:58:03,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=17593.333333333332, ans=0.0 2023-10-09 12:58:16,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=17593.333333333332, ans=0.0 2023-10-09 12:58:16,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=17593.333333333332, ans=0.125 2023-10-09 12:58:29,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=17686.666666666668, ans=0.125 2023-10-09 12:58:29,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=17686.666666666668, ans=0.007024637681159421 2023-10-09 12:58:33,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=17686.666666666668, ans=0.125 2023-10-09 12:58:36,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=17686.666666666668, ans=0.0 2023-10-09 12:58:39,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.68 vs. limit=14.1325 2023-10-09 12:58:42,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=17733.333333333332, ans=0.46599999999999997 2023-10-09 12:58:43,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17733.333333333332, ans=0.0 2023-10-09 12:58:48,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=17733.333333333332, ans=0.125 2023-10-09 12:58:50,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 2.714e+02 3.272e+02 4.158e+02 7.795e+02, threshold=6.544e+02, percent-clipped=4.0 2023-10-09 12:59:00,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=17780.0, ans=0.125 2023-10-09 12:59:10,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17826.666666666668, ans=0.1217333333333333 2023-10-09 12:59:26,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.39 vs. limit=20.905 2023-10-09 13:00:07,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=18060.0, ans=0.0 2023-10-09 13:00:11,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=18060.0, ans=0.006943478260869565 2023-10-09 13:00:15,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=18060.0, ans=14.2725 2023-10-09 13:00:40,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=18153.333333333332, ans=0.125 2023-10-09 13:00:50,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.714e+02 3.196e+02 4.125e+02 7.091e+02, threshold=6.391e+02, percent-clipped=5.0 2023-10-09 13:00:58,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=18246.666666666668, ans=0.125 2023-10-09 13:00:59,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18246.666666666668, ans=0.11753333333333332 2023-10-09 13:01:03,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=18246.666666666668, ans=0.125 2023-10-09 13:01:43,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.27 vs. limit=14.216666666666665 2023-10-09 13:01:44,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=18433.333333333332, ans=0.125 2023-10-09 13:01:47,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=18433.333333333332, ans=0.0 2023-10-09 13:01:52,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=14.43 2023-10-09 13:01:52,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.26 vs. limit=14.43 2023-10-09 13:02:03,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=18526.666666666668, ans=0.0 2023-10-09 13:02:11,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=18526.666666666668, ans=0.025 2023-10-09 13:02:15,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.33 vs. limit=21.43 2023-10-09 13:02:36,829 INFO [train.py:1031] (3/4) Epoch 1, batch 4000, loss[loss=0.3809, simple_loss=0.4208, pruned_loss=0.1705, over 15854.00 frames. ], tot_loss[loss=0.4787, simple_loss=0.4876, pruned_loss=0.2575, over 28424728.45 frames. ], batch size: 43, lr: 4.23e-02, grad_scale: 32.0 2023-10-09 13:02:42,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=18666.666666666668, ans=0.2466666666666667 2023-10-09 13:02:44,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.836e+02 2.666e+02 3.323e+02 4.176e+02 7.295e+02, threshold=6.645e+02, percent-clipped=3.0 2023-10-09 13:02:56,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=18713.333333333332, ans=0.05 2023-10-09 13:03:20,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18806.666666666668, ans=0.1119333333333333 2023-10-09 13:03:28,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=18853.333333333332, ans=0.0 2023-10-09 13:03:28,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=18853.333333333332, ans=0.125 2023-10-09 13:03:35,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.72 vs. limit=21.675 2023-10-09 13:04:04,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=18993.333333333332, ans=0.2352333333333334 2023-10-09 13:04:09,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19040.0, ans=0.1096 2023-10-09 13:04:11,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=19040.0, ans=0.23360000000000003 2023-10-09 13:04:13,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=14.64 2023-10-09 13:04:21,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19086.666666666668, ans=0.1091333333333333 2023-10-09 13:04:28,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=19086.666666666668, ans=0.09899494936611666 2023-10-09 13:04:33,608 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:04:36,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=19133.333333333332, ans=0.05866666666666667 2023-10-09 13:04:37,106 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.715e+02 3.196e+02 3.648e+02 6.112e+02, threshold=6.393e+02, percent-clipped=0.0 2023-10-09 13:04:37,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=19133.333333333332, ans=0.07 2023-10-09 13:04:41,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=19180.0, ans=0.125 2023-10-09 13:04:53,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=19226.666666666668, ans=0.09899494936611666 2023-10-09 13:04:55,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.85 vs. limit=9.806666666666668 2023-10-09 13:04:57,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=19226.666666666668, ans=0.125 2023-10-09 13:05:00,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=19226.666666666668, ans=0.006689855072463767 2023-10-09 13:05:05,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=19273.333333333332, ans=0.125 2023-10-09 13:05:19,912 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.02 vs. limit=5.898 2023-10-09 13:05:20,564 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:05:33,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=19366.666666666668, ans=0.22216666666666673 2023-10-09 13:05:47,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19413.333333333332, ans=0.125 2023-10-09 13:06:09,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=19460.0, ans=0.055400000000000005 2023-10-09 13:06:18,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=19506.666666666668, ans=0.006628985507246376 2023-10-09 13:06:24,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.85 vs. limit=14.753333333333334 2023-10-09 13:06:32,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=19553.333333333332, ans=0.125 2023-10-09 13:06:36,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=19553.333333333332, ans=0.09899494936611666 2023-10-09 13:06:39,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19600.0, ans=0.10400000000000001 2023-10-09 13:06:41,480 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=14.85 2023-10-09 13:06:46,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.515e+02 2.991e+02 3.638e+02 5.967e+02, threshold=5.982e+02, percent-clipped=0.0 2023-10-09 13:06:48,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=11.84 2023-10-09 13:06:59,596 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.22 vs. limit=14.8675 2023-10-09 13:07:05,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=19693.333333333332, ans=0.0 2023-10-09 13:07:19,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=19740.0, ans=0.125 2023-10-09 13:08:10,868 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=14.99 2023-10-09 13:08:11,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.35 vs. limit=22.48 2023-10-09 13:08:30,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.49 vs. limit=22.5 2023-10-09 13:08:35,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.584e+02 3.031e+02 3.766e+02 5.460e+02, threshold=6.061e+02, percent-clipped=0.0 2023-10-09 13:08:41,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=20113.333333333332, ans=0.02 2023-10-09 13:09:03,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=20206.666666666668, ans=0.125 2023-10-09 13:09:08,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=20206.666666666668, ans=0.125 2023-10-09 13:09:24,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=20300.0, ans=0.2 2023-10-09 13:09:34,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=20346.666666666668, ans=0.125 2023-10-09 13:09:34,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=20346.666666666668, ans=0.0 2023-10-09 13:09:36,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=20346.666666666668, ans=0.125 2023-10-09 13:09:42,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=20346.666666666668, ans=0.2 2023-10-09 13:09:50,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=20393.333333333332, ans=0.125 2023-10-09 13:09:55,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=20393.333333333332, ans=0.125 2023-10-09 13:10:10,071 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2023-10-09 13:10:20,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=20486.666666666668, ans=0.125 2023-10-09 13:10:21,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=20533.333333333332, ans=0.125 2023-10-09 13:10:23,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=20533.333333333332, ans=0.125 2023-10-09 13:10:24,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=20533.333333333332, ans=0.5 2023-10-09 13:10:28,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=20533.333333333332, ans=0.125 2023-10-09 13:10:29,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.759e+02 3.156e+02 3.836e+02 6.218e+02, threshold=6.312e+02, percent-clipped=1.0 2023-10-09 13:10:32,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.12 vs. limit=15.0 2023-10-09 13:10:38,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=20580.0, ans=0.02 2023-10-09 13:10:54,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=20626.666666666668, ans=0.0 2023-10-09 13:10:56,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-10-09 13:11:06,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-10-09 13:11:27,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=20766.666666666668, ans=0.125 2023-10-09 13:11:32,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=20766.666666666668, ans=0.0 2023-10-09 13:11:42,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=20813.333333333332, ans=0.125 2023-10-09 13:11:49,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20813.333333333332, ans=0.1 2023-10-09 13:12:28,334 INFO [train.py:1031] (3/4) Epoch 1, batch 4500, loss[loss=0.3742, simple_loss=0.4239, pruned_loss=0.1622, over 16841.00 frames. ], tot_loss[loss=0.4561, simple_loss=0.4728, pruned_loss=0.2367, over 29388300.38 frames. ], batch size: 87, lr: 4.17e-02, grad_scale: 32.0 2023-10-09 13:12:31,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.67 vs. limit=15.0 2023-10-09 13:12:36,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.417e+02 3.022e+02 3.679e+02 7.488e+02, threshold=6.044e+02, percent-clipped=5.0 2023-10-09 13:12:59,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=21093.333333333332, ans=0.006284057971014493 2023-10-09 13:13:30,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21233.333333333332, ans=0.125 2023-10-09 13:13:30,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-10-09 13:13:39,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=21280.0, ans=0.2 2023-10-09 13:13:45,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=21280.0, ans=0.125 2023-10-09 13:13:51,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=21326.666666666668, ans=0.125 2023-10-09 13:13:52,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=21326.666666666668, ans=0.0 2023-10-09 13:13:54,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=21326.666666666668, ans=0.0 2023-10-09 13:14:11,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.54 vs. limit=22.5 2023-10-09 13:14:18,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=21420.0, ans=0.0 2023-10-09 13:14:20,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=21420.0, ans=0.035 2023-10-09 13:14:28,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=21466.666666666668, ans=0.2 2023-10-09 13:14:31,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21466.666666666668, ans=0.125 2023-10-09 13:14:32,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.468e+02 3.069e+02 3.788e+02 6.070e+02, threshold=6.138e+02, percent-clipped=1.0 2023-10-09 13:14:58,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=21606.666666666668, ans=0.2 2023-10-09 13:15:01,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21606.666666666668, ans=0.125 2023-10-09 13:15:02,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=12.0 2023-10-09 13:15:31,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=15.0 2023-10-09 13:15:32,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=21746.666666666668, ans=0.0 2023-10-09 13:15:32,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21746.666666666668, ans=0.1 2023-10-09 13:15:33,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=21746.666666666668, ans=0.006142028985507246 2023-10-09 13:15:35,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=21746.666666666668, ans=0.125 2023-10-09 13:15:47,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=21793.333333333332, ans=0.07 2023-10-09 13:16:16,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=21933.333333333332, ans=0.0 2023-10-09 13:16:18,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=21933.333333333332, ans=0.125 2023-10-09 13:16:24,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.532e+02 2.894e+02 3.340e+02 5.628e+02, threshold=5.787e+02, percent-clipped=0.0 2023-10-09 13:16:45,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=22026.666666666668, ans=0.006081159420289855 2023-10-09 13:16:54,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=22073.333333333332, ans=0.125 2023-10-09 13:17:06,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=22120.0, ans=0.125 2023-10-09 13:17:10,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=22166.666666666668, ans=0.025 2023-10-09 13:17:32,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=22260.0, ans=0.125 2023-10-09 13:17:41,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=22260.0, ans=0.125 2023-10-09 13:18:00,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=22353.333333333332, ans=0.0 2023-10-09 13:18:13,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.459e+02 2.800e+02 3.330e+02 5.526e+02, threshold=5.599e+02, percent-clipped=0.0 2023-10-09 13:18:39,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=22493.333333333332, ans=0.125 2023-10-09 13:18:56,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.90 vs. limit=15.0 2023-10-09 13:18:56,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=15.0 2023-10-09 13:19:09,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=22633.333333333332, ans=0.125 2023-10-09 13:19:09,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=22633.333333333332, ans=0.04949747468305833 2023-10-09 13:19:13,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=22633.333333333332, ans=0.125 2023-10-09 13:19:28,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=22726.666666666668, ans=0.125 2023-10-09 13:19:37,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=22.5 2023-10-09 13:19:44,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=22773.333333333332, ans=0.125 2023-10-09 13:20:07,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.530e+02 2.781e+02 3.399e+02 7.715e+02, threshold=5.563e+02, percent-clipped=3.0 2023-10-09 13:20:17,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22913.333333333332, ans=0.0 2023-10-09 13:20:20,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=22913.333333333332, ans=0.2 2023-10-09 13:20:37,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=23006.666666666668, ans=0.1 2023-10-09 13:20:47,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-10-09 13:20:56,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-10-09 13:20:57,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=23053.333333333332, ans=0.125 2023-10-09 13:21:14,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=23146.666666666668, ans=0.0 2023-10-09 13:22:01,749 INFO [train.py:1031] (3/4) Epoch 1, batch 5000, loss[loss=0.3952, simple_loss=0.4331, pruned_loss=0.1786, over 16845.00 frames. ], tot_loss[loss=0.4373, simple_loss=0.4601, pruned_loss=0.2202, over 30129987.56 frames. ], batch size: 188, lr: 4.10e-02, grad_scale: 32.0 2023-10-09 13:22:09,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.559e+02 3.026e+02 3.665e+02 5.958e+02, threshold=6.051e+02, percent-clipped=2.0 2023-10-09 13:22:15,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.99 vs. limit=6.0 2023-10-09 13:22:58,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=23566.666666666668, ans=0.1 2023-10-09 13:23:06,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=23566.666666666668, ans=0.025 2023-10-09 13:23:27,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.02 vs. limit=6.0 2023-10-09 13:23:35,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=23706.666666666668, ans=0.0 2023-10-09 13:23:35,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.29 vs. limit=15.0 2023-10-09 13:23:37,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.99 vs. limit=22.5 2023-10-09 13:23:53,091 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2023-10-09 13:24:02,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=23800.0, ans=0.015 2023-10-09 13:24:06,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.522e+02 3.065e+02 3.596e+02 6.594e+02, threshold=6.131e+02, percent-clipped=1.0 2023-10-09 13:24:16,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=23846.666666666668, ans=0.0056855072463768115 2023-10-09 13:24:25,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=15.0 2023-10-09 13:24:27,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=23893.333333333332, ans=0.125 2023-10-09 13:25:01,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24033.333333333332, ans=0.1 2023-10-09 13:25:01,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=24033.333333333332, ans=15.0 2023-10-09 13:25:03,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=24033.333333333332, ans=0.2 2023-10-09 13:25:09,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24080.0, ans=0.1 2023-10-09 13:25:28,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24126.666666666668, ans=0.1 2023-10-09 13:26:03,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.584e+02 2.927e+02 3.538e+02 5.245e+02, threshold=5.854e+02, percent-clipped=0.0 2023-10-09 13:26:09,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=24313.333333333332, ans=0.05 2023-10-09 13:26:12,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-10-09 13:26:15,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.42 vs. limit=10.0 2023-10-09 13:26:20,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.29 vs. limit=15.0 2023-10-09 13:26:32,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=24406.666666666668, ans=0.125 2023-10-09 13:26:32,508 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=15.0 2023-10-09 13:26:37,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=24406.666666666668, ans=0.125 2023-10-09 13:26:44,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=24453.333333333332, ans=10.0 2023-10-09 13:27:01,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=24500.0, ans=0.125 2023-10-09 13:27:59,743 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.391e+02 2.671e+02 3.164e+02 6.020e+02, threshold=5.342e+02, percent-clipped=1.0 2023-10-09 13:28:13,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=24780.0, ans=0.0 2023-10-09 13:28:31,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=24873.333333333332, ans=0.2 2023-10-09 13:28:32,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=24873.333333333332, ans=0.125 2023-10-09 13:28:42,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=24873.333333333332, ans=0.0 2023-10-09 13:29:01,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=24966.666666666668, ans=0.125 2023-10-09 13:29:15,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=25013.333333333332, ans=0.0 2023-10-09 13:29:31,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=25106.666666666668, ans=15.0 2023-10-09 13:29:46,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.12 vs. limit=5.0 2023-10-09 13:29:49,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2023-10-09 13:29:57,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=25200.0, ans=0.125 2023-10-09 13:30:06,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.489e+02 2.857e+02 3.222e+02 5.031e+02, threshold=5.715e+02, percent-clipped=0.0 2023-10-09 13:30:09,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2023-10-09 13:30:10,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=25246.666666666668, ans=0.2 2023-10-09 13:30:10,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-09 13:30:12,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.44 vs. limit=12.0 2023-10-09 13:30:17,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=25246.666666666668, ans=0.0 2023-10-09 13:31:07,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25433.333333333332, ans=0.125 2023-10-09 13:31:14,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=25480.0, ans=0.005330434782608696 2023-10-09 13:31:26,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=25526.666666666668, ans=0.005320289855072464 2023-10-09 13:31:28,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25526.666666666668, ans=0.1 2023-10-09 13:31:29,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=25526.666666666668, ans=0.0 2023-10-09 13:31:39,824 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.85 vs. limit=15.0 2023-10-09 13:31:40,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=25573.333333333332, ans=0.0 2023-10-09 13:31:42,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25620.0, ans=0.1 2023-10-09 13:31:43,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25620.0, ans=0.1 2023-10-09 13:31:46,564 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-10-09 13:31:50,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=25620.0, ans=0.0 2023-10-09 13:31:50,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=25620.0, ans=0.1 2023-10-09 13:31:54,426 INFO [train.py:1031] (3/4) Epoch 1, batch 5500, loss[loss=0.4339, simple_loss=0.4578, pruned_loss=0.205, over 16004.00 frames. ], tot_loss[loss=0.4223, simple_loss=0.4501, pruned_loss=0.2071, over 30717926.21 frames. ], batch size: 296, lr: 4.04e-02, grad_scale: 32.0 2023-10-09 13:31:57,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.75 vs. limit=15.0 2023-10-09 13:32:01,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.632e+02 2.940e+02 3.489e+02 6.640e+02, threshold=5.880e+02, percent-clipped=1.0 2023-10-09 13:32:10,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=25713.333333333332, ans=0.1 2023-10-09 13:32:23,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.42 vs. limit=15.0 2023-10-09 13:32:30,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=25806.666666666668, ans=0.125 2023-10-09 13:32:33,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=25806.666666666668, ans=0.1 2023-10-09 13:32:39,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25853.333333333332, ans=0.1 2023-10-09 13:33:22,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=26040.0, ans=0.125 2023-10-09 13:33:27,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.23 vs. limit=15.0 2023-10-09 13:33:43,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=26133.333333333332, ans=0.0 2023-10-09 13:33:47,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26133.333333333332, ans=0.1 2023-10-09 13:33:47,463 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:33:51,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.435e+02 2.773e+02 3.403e+02 4.938e+02, threshold=5.546e+02, percent-clipped=0.0 2023-10-09 13:33:53,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2023-10-09 13:34:03,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=26180.0, ans=0.5 2023-10-09 13:34:08,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=26226.666666666668, ans=0.125 2023-10-09 13:34:42,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=26366.666666666668, ans=0.0 2023-10-09 13:34:42,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=26366.666666666668, ans=0.125 2023-10-09 13:34:54,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26413.333333333332, ans=0.1 2023-10-09 13:34:54,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=26413.333333333332, ans=0.125 2023-10-09 13:35:07,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=26460.0, ans=0.0 2023-10-09 13:35:08,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26460.0, ans=0.1 2023-10-09 13:35:16,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-10-09 13:35:22,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=26506.666666666668, ans=0.04949747468305833 2023-10-09 13:35:25,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=26553.333333333332, ans=0.125 2023-10-09 13:35:26,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=26553.333333333332, ans=0.0 2023-10-09 13:35:34,376 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.17 vs. limit=22.5 2023-10-09 13:35:40,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=26600.0, ans=0.0 2023-10-09 13:35:44,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.429e+02 2.842e+02 3.278e+02 4.829e+02, threshold=5.683e+02, percent-clipped=0.0 2023-10-09 13:36:06,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-10-09 13:36:07,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26693.333333333332, ans=0.1 2023-10-09 13:36:54,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26880.0, ans=0.0 2023-10-09 13:36:58,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=26926.666666666668, ans=0.2 2023-10-09 13:37:03,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=26926.666666666668, ans=0.125 2023-10-09 13:37:11,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-10-09 13:37:29,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=27020.0, ans=0.1 2023-10-09 13:37:41,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.506e+02 2.905e+02 3.814e+02 6.231e+02, threshold=5.809e+02, percent-clipped=4.0 2023-10-09 13:37:42,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.45 vs. limit=22.5 2023-10-09 13:37:43,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=27066.666666666668, ans=0.125 2023-10-09 13:38:05,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.38 vs. limit=22.5 2023-10-09 13:38:10,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=27206.666666666668, ans=0.2 2023-10-09 13:38:12,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27206.666666666668, ans=0.1 2023-10-09 13:38:34,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=27300.0, ans=0.125 2023-10-09 13:38:35,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.13 vs. limit=22.5 2023-10-09 13:38:55,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=27393.333333333332, ans=0.125 2023-10-09 13:39:04,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27440.0, ans=0.1 2023-10-09 13:39:11,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=27440.0, ans=0.1 2023-10-09 13:39:30,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=27533.333333333332, ans=0.125 2023-10-09 13:39:32,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=27533.333333333332, ans=0.125 2023-10-09 13:39:33,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.674e+02 3.191e+02 3.498e+02 4.882e+02, threshold=6.383e+02, percent-clipped=0.0 2023-10-09 13:39:37,637 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:39:52,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=27626.666666666668, ans=0.2 2023-10-09 13:39:54,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=27626.666666666668, ans=0.125 2023-10-09 13:40:07,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=27673.333333333332, ans=0.02 2023-10-09 13:40:08,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=12.0 2023-10-09 13:40:14,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27720.0, ans=0.1 2023-10-09 13:40:33,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=27766.666666666668, ans=0.004833333333333333 2023-10-09 13:40:57,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-10-09 13:41:18,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.94 vs. limit=15.0 2023-10-09 13:41:21,081 INFO [train.py:1031] (3/4) Epoch 1, batch 6000, loss[loss=0.3667, simple_loss=0.4082, pruned_loss=0.1626, over 15943.00 frames. ], tot_loss[loss=0.4108, simple_loss=0.4425, pruned_loss=0.1972, over 31177124.64 frames. ], batch size: 43, lr: 3.98e-02, grad_scale: 32.0 2023-10-09 13:41:21,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=28000.0, ans=0.125 2023-10-09 13:41:27,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=28000.0, ans=0.0 2023-10-09 13:41:27,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=28000.0, ans=0.125 2023-10-09 13:41:28,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.439e+02 2.949e+02 3.440e+02 5.726e+02, threshold=5.899e+02, percent-clipped=0.0 2023-10-09 13:41:31,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=28046.666666666668, ans=0.0 2023-10-09 13:41:53,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=28093.333333333332, ans=0.2 2023-10-09 13:41:54,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=12.0 2023-10-09 13:41:58,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=28140.0, ans=0.004752173913043479 2023-10-09 13:42:06,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=28186.666666666668, ans=0.125 2023-10-09 13:42:20,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=28233.333333333332, ans=0.0 2023-10-09 13:42:23,405 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:42:29,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28280.0, ans=0.1 2023-10-09 13:42:42,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28326.666666666668, ans=0.1 2023-10-09 13:43:20,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.429e+02 2.787e+02 3.328e+02 4.921e+02, threshold=5.573e+02, percent-clipped=0.0 2023-10-09 13:43:20,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=28466.666666666668, ans=0.125 2023-10-09 13:43:24,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28513.333333333332, ans=0.125 2023-10-09 13:43:40,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=28560.0, ans=0.125 2023-10-09 13:44:17,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28700.0, ans=0.1 2023-10-09 13:44:44,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=28793.333333333332, ans=0.125 2023-10-09 13:44:47,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=28793.333333333332, ans=0.125 2023-10-09 13:44:59,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.71 vs. limit=22.5 2023-10-09 13:45:16,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=28933.333333333332, ans=0.004579710144927537 2023-10-09 13:45:18,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=28933.333333333332, ans=0.05 2023-10-09 13:45:21,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.517e+02 2.954e+02 3.598e+02 5.663e+02, threshold=5.908e+02, percent-clipped=1.0 2023-10-09 13:45:32,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=28980.0, ans=0.125 2023-10-09 13:45:42,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.92 vs. limit=22.5 2023-10-09 13:45:51,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=29073.333333333332, ans=0.04949747468305833 2023-10-09 13:45:55,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=29073.333333333332, ans=0.0 2023-10-09 13:46:17,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=29166.666666666668, ans=0.2 2023-10-09 13:46:22,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=29213.333333333332, ans=10.0 2023-10-09 13:46:24,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=29213.333333333332, ans=0.07 2023-10-09 13:46:26,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=29213.333333333332, ans=0.004518840579710145 2023-10-09 13:46:28,995 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:46:38,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2023-10-09 13:47:15,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.458e+02 2.665e+02 3.145e+02 5.633e+02, threshold=5.329e+02, percent-clipped=0.0 2023-10-09 13:47:19,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=29446.666666666668, ans=0.125 2023-10-09 13:47:49,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29540.0, ans=0.125 2023-10-09 13:48:09,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=29633.333333333332, ans=0.0 2023-10-09 13:48:15,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=29633.333333333332, ans=0.125 2023-10-09 13:48:19,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=29680.0, ans=0.05 2023-10-09 13:48:39,084 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=15.0 2023-10-09 13:48:54,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.00 vs. limit=15.0 2023-10-09 13:49:01,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.93 vs. limit=22.5 2023-10-09 13:49:18,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=29866.666666666668, ans=0.125 2023-10-09 13:49:22,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.316e+02 2.629e+02 2.979e+02 6.107e+02, threshold=5.258e+02, percent-clipped=1.0 2023-10-09 13:49:42,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.40 vs. limit=22.5 2023-10-09 13:49:49,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29960.0, ans=0.1 2023-10-09 13:49:50,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-09 13:49:54,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=30006.666666666668, ans=0.1 2023-10-09 13:50:04,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=30053.333333333332, ans=0.125 2023-10-09 13:50:29,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=30146.666666666668, ans=0.125 2023-10-09 13:50:44,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.27 vs. limit=22.5 2023-10-09 13:50:53,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=30240.0, ans=0.0 2023-10-09 13:51:08,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=30286.666666666668, ans=0.125 2023-10-09 13:51:13,428 INFO [train.py:1031] (3/4) Epoch 1, batch 6500, loss[loss=0.3721, simple_loss=0.4278, pruned_loss=0.1582, over 16850.00 frames. ], tot_loss[loss=0.4022, simple_loss=0.4372, pruned_loss=0.1895, over 31573477.21 frames. ], batch size: 87, lr: 3.91e-02, grad_scale: 32.0 2023-10-09 13:51:21,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.425e+02 2.754e+02 3.269e+02 4.851e+02, threshold=5.508e+02, percent-clipped=0.0 2023-10-09 13:51:43,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-10-09 13:52:04,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=30473.333333333332, ans=0.125 2023-10-09 13:52:21,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=30566.666666666668, ans=22.5 2023-10-09 13:52:26,764 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=15.0 2023-10-09 13:52:34,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.06 vs. limit=22.5 2023-10-09 13:52:37,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=30613.333333333332, ans=10.0 2023-10-09 13:52:50,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=30660.0, ans=0.125 2023-10-09 13:52:52,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=30660.0, ans=0.1 2023-10-09 13:52:52,857 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:52:53,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=30706.666666666668, ans=0.125 2023-10-09 13:53:03,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=30706.666666666668, ans=0.0 2023-10-09 13:53:28,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.544e+02 2.969e+02 3.595e+02 5.241e+02, threshold=5.937e+02, percent-clipped=0.0 2023-10-09 13:53:52,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=30940.0, ans=0.2 2023-10-09 13:53:58,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=30940.0, ans=0.004143478260869566 2023-10-09 13:54:15,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=30986.666666666668, ans=0.2 2023-10-09 13:54:41,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=31080.0, ans=0.125 2023-10-09 13:55:01,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=31173.333333333332, ans=0.05 2023-10-09 13:55:04,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.38 vs. limit=15.0 2023-10-09 13:55:19,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=31266.666666666668, ans=0.125 2023-10-09 13:55:23,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.591e+02 2.941e+02 3.296e+02 5.536e+02, threshold=5.882e+02, percent-clipped=0.0 2023-10-09 13:55:31,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.88 vs. limit=15.0 2023-10-09 13:55:34,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=31313.333333333332, ans=0.125 2023-10-09 13:55:36,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=31360.0, ans=0.1 2023-10-09 13:56:21,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=31500.0, ans=0.0 2023-10-09 13:56:24,212 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:56:33,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.89 vs. limit=6.0 2023-10-09 13:56:50,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31640.0, ans=0.1 2023-10-09 13:57:03,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.72 vs. limit=10.0 2023-10-09 13:57:12,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=31686.666666666668, ans=0.125 2023-10-09 13:57:29,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.466e+02 2.912e+02 3.541e+02 6.118e+02, threshold=5.825e+02, percent-clipped=1.0 2023-10-09 13:57:30,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=12.0 2023-10-09 13:57:50,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31826.666666666668, ans=0.1 2023-10-09 13:57:52,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31826.666666666668, ans=0.1 2023-10-09 13:58:00,724 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2023-10-09 13:58:14,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=31873.333333333332, ans=0.003940579710144928 2023-10-09 13:58:33,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31966.666666666668, ans=0.1 2023-10-09 13:58:34,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=31966.666666666668, ans=0.125 2023-10-09 13:58:37,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31966.666666666668, ans=0.1 2023-10-09 13:58:39,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=31966.666666666668, ans=0.2 2023-10-09 13:59:42,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=32200.0, ans=0.0 2023-10-09 13:59:43,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.377e+02 2.677e+02 3.003e+02 4.420e+02, threshold=5.354e+02, percent-clipped=0.0 2023-10-09 13:59:51,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=32246.666666666668, ans=0.07 2023-10-09 14:00:58,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32526.666666666668, ans=0.125 2023-10-09 14:01:03,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=32526.666666666668, ans=0.125 2023-10-09 14:01:18,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.39 vs. limit=22.5 2023-10-09 14:01:25,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32620.0, ans=0.1 2023-10-09 14:01:35,234 INFO [train.py:1031] (3/4) Epoch 1, batch 7000, loss[loss=0.3438, simple_loss=0.4084, pruned_loss=0.1396, over 16894.00 frames. ], tot_loss[loss=0.394, simple_loss=0.432, pruned_loss=0.1825, over 31852905.50 frames. ], batch size: 104, lr: 3.85e-02, grad_scale: 32.0 2023-10-09 14:01:44,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.723e+02 2.964e+02 3.517e+02 5.853e+02, threshold=5.928e+02, percent-clipped=1.0 2023-10-09 14:01:48,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32713.333333333332, ans=0.125 2023-10-09 14:02:03,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.47 vs. limit=6.0 2023-10-09 14:02:15,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32806.666666666664, ans=0.125 2023-10-09 14:02:20,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.98 vs. limit=15.0 2023-10-09 14:03:03,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=32993.333333333336, ans=0.0 2023-10-09 14:03:15,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33040.0, ans=0.1 2023-10-09 14:03:34,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=33086.666666666664, ans=0.0 2023-10-09 14:03:42,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-09 14:03:42,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.378e+02 2.771e+02 3.167e+02 5.382e+02, threshold=5.541e+02, percent-clipped=0.0 2023-10-09 14:04:20,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=33320.0, ans=0.0 2023-10-09 14:04:25,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=33320.0, ans=0.125 2023-10-09 14:04:55,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.22 vs. limit=15.0 2023-10-09 14:05:00,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=33460.0, ans=0.2 2023-10-09 14:05:02,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=33460.0, ans=0.125 2023-10-09 14:05:06,903 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:05:12,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=33506.666666666664, ans=0.003585507246376812 2023-10-09 14:05:18,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=33506.666666666664, ans=0.125 2023-10-09 14:05:31,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=33553.333333333336, ans=0.125 2023-10-09 14:05:52,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.465e+02 2.887e+02 3.467e+02 4.858e+02, threshold=5.773e+02, percent-clipped=0.0 2023-10-09 14:05:58,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33646.666666666664, ans=0.1 2023-10-09 14:06:15,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33693.333333333336, ans=0.1 2023-10-09 14:06:18,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.38 vs. limit=22.5 2023-10-09 14:06:37,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=33740.0, ans=0.125 2023-10-09 14:06:45,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-10-09 14:06:55,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=33786.666666666664, ans=0.125 2023-10-09 14:06:57,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=33833.333333333336, ans=0.125 2023-10-09 14:07:04,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=33833.333333333336, ans=0.0035144927536231874 2023-10-09 14:07:28,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=33926.666666666664, ans=0.125 2023-10-09 14:07:40,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33973.333333333336, ans=0.1 2023-10-09 14:07:42,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=33973.333333333336, ans=0.2 2023-10-09 14:07:43,977 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.00 vs. limit=15.0 2023-10-09 14:08:06,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.420e+02 2.746e+02 3.159e+02 5.758e+02, threshold=5.493e+02, percent-clipped=0.0 2023-10-09 14:08:07,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=34066.666666666664, ans=0.125 2023-10-09 14:08:08,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=34066.666666666664, ans=0.0 2023-10-09 14:08:22,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=34160.0, ans=0.125 2023-10-09 14:08:34,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=34160.0, ans=0.02 2023-10-09 14:08:38,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2023-10-09 14:08:45,171 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-10-09 14:08:54,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=34253.333333333336, ans=0.125 2023-10-09 14:08:57,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-10-09 14:08:59,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.42 vs. limit=22.5 2023-10-09 14:09:14,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34346.666666666664, ans=0.1 2023-10-09 14:09:19,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=34346.666666666664, ans=0.0 2023-10-09 14:09:19,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-10-09 14:09:52,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.98 vs. limit=15.0 2023-10-09 14:10:08,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.266e+02 2.571e+02 2.962e+02 5.200e+02, threshold=5.142e+02, percent-clipped=0.0 2023-10-09 14:10:12,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=34580.0, ans=0.125 2023-10-09 14:10:24,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=34626.666666666664, ans=0.125 2023-10-09 14:10:24,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=34626.666666666664, ans=0.035 2023-10-09 14:10:35,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=34673.333333333336, ans=0.0 2023-10-09 14:10:45,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34673.333333333336, ans=0.1 2023-10-09 14:11:09,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=34813.333333333336, ans=0.95 2023-10-09 14:11:12,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=34813.333333333336, ans=0.125 2023-10-09 14:11:14,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=15.0 2023-10-09 14:11:36,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.90 vs. limit=15.0 2023-10-09 14:11:49,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=34953.333333333336, ans=0.2 2023-10-09 14:11:52,982 INFO [train.py:1031] (3/4) Epoch 1, batch 7500, loss[loss=0.4287, simple_loss=0.4501, pruned_loss=0.2037, over 16523.00 frames. ], tot_loss[loss=0.3871, simple_loss=0.4271, pruned_loss=0.1771, over 32024703.66 frames. ], batch size: 266, lr: 3.78e-02, grad_scale: 32.0 2023-10-09 14:12:01,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.617e+02 2.954e+02 3.488e+02 4.638e+02, threshold=5.909e+02, percent-clipped=0.0 2023-10-09 14:12:12,494 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.46 vs. limit=22.5 2023-10-09 14:12:15,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=35093.333333333336, ans=0.0032405797101449276 2023-10-09 14:12:15,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=35093.333333333336, ans=0.125 2023-10-09 14:12:40,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35186.666666666664, ans=0.1 2023-10-09 14:13:26,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=35373.333333333336, ans=0.125 2023-10-09 14:13:46,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=35466.666666666664, ans=0.0 2023-10-09 14:13:51,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.430e+02 2.756e+02 3.175e+02 5.846e+02, threshold=5.511e+02, percent-clipped=0.0 2023-10-09 14:14:00,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35513.333333333336, ans=0.0 2023-10-09 14:14:13,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35560.0, ans=0.1 2023-10-09 14:14:13,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=35560.0, ans=0.0031391304347826087 2023-10-09 14:14:32,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=35606.666666666664, ans=0.0031289855072463768 2023-10-09 14:14:37,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=35653.333333333336, ans=0.125 2023-10-09 14:14:52,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2023-10-09 14:15:03,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=35700.0, ans=0.2 2023-10-09 14:15:19,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=35793.333333333336, ans=0.125 2023-10-09 14:16:02,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=35933.333333333336, ans=0.003057971014492753 2023-10-09 14:16:04,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.407e+02 2.784e+02 3.305e+02 6.434e+02, threshold=5.568e+02, percent-clipped=1.0 2023-10-09 14:16:08,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.55 vs. limit=15.0 2023-10-09 14:16:15,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.12 vs. limit=22.5 2023-10-09 14:16:18,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.50 vs. limit=15.0 2023-10-09 14:16:47,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=36120.0, ans=0.0 2023-10-09 14:16:53,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=36166.666666666664, ans=0.125 2023-10-09 14:16:56,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=36166.666666666664, ans=0.003007246376811595 2023-10-09 14:16:58,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=36166.666666666664, ans=0.0 2023-10-09 14:17:07,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=36213.333333333336, ans=0.5 2023-10-09 14:17:15,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=36260.0, ans=0.125 2023-10-09 14:17:22,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=36260.0, ans=0.125 2023-10-09 14:17:40,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=36353.333333333336, ans=0.125 2023-10-09 14:17:47,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=36353.333333333336, ans=0.0 2023-10-09 14:17:48,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=36353.333333333336, ans=0.125 2023-10-09 14:17:59,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.500e+02 2.917e+02 3.377e+02 6.020e+02, threshold=5.835e+02, percent-clipped=2.0 2023-10-09 14:18:00,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=36400.0, ans=0.125 2023-10-09 14:18:05,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36446.666666666664, ans=0.125 2023-10-09 14:18:06,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=36446.666666666664, ans=0.0 2023-10-09 14:18:08,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=36446.666666666664, ans=0.125 2023-10-09 14:18:17,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=36493.333333333336, ans=0.125 2023-10-09 14:18:39,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2023-10-09 14:18:58,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36633.333333333336, ans=0.125 2023-10-09 14:19:10,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-10-09 14:19:11,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-10-09 14:19:14,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=36726.666666666664, ans=0.09899494936611666 2023-10-09 14:19:15,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-10-09 14:19:38,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=36820.0, ans=0.125 2023-10-09 14:19:49,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=36820.0, ans=0.2 2023-10-09 14:20:03,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.447e+02 2.750e+02 3.178e+02 4.670e+02, threshold=5.500e+02, percent-clipped=0.0 2023-10-09 14:20:16,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.07 vs. limit=10.0 2023-10-09 14:20:53,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=37053.333333333336, ans=0.125 2023-10-09 14:21:14,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=37146.666666666664, ans=0.5 2023-10-09 14:22:01,645 INFO [train.py:1031] (3/4) Epoch 1, batch 8000, loss[loss=0.4053, simple_loss=0.4341, pruned_loss=0.1883, over 16091.00 frames. ], tot_loss[loss=0.38, simple_loss=0.4223, pruned_loss=0.1716, over 32193893.63 frames. ], batch size: 296, lr: 3.72e-02, grad_scale: 32.0 2023-10-09 14:22:02,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=15.0 2023-10-09 14:22:03,610 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:22:09,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 2.324e+02 2.653e+02 3.381e+02 4.972e+02, threshold=5.305e+02, percent-clipped=0.0 2023-10-09 14:22:16,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.37 vs. limit=6.0 2023-10-09 14:22:22,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=37426.666666666664, ans=0.125 2023-10-09 14:23:00,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.38 vs. limit=10.0 2023-10-09 14:23:06,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37566.666666666664, ans=0.0 2023-10-09 14:23:11,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=37613.333333333336, ans=0.2 2023-10-09 14:23:15,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=37613.333333333336, ans=0.125 2023-10-09 14:23:15,136 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:23:17,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2023-10-09 14:23:30,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37706.666666666664, ans=0.125 2023-10-09 14:23:30,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=37706.666666666664, ans=0.09899494936611666 2023-10-09 14:23:30,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.37 vs. limit=10.0 2023-10-09 14:23:34,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=37706.666666666664, ans=0.025 2023-10-09 14:23:54,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37800.0, ans=0.125 2023-10-09 14:23:57,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37800.0, ans=0.1 2023-10-09 14:23:57,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.03 vs. limit=6.0 2023-10-09 14:24:01,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.589e+02 2.920e+02 3.234e+02 4.423e+02, threshold=5.840e+02, percent-clipped=0.0 2023-10-09 14:24:06,489 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.61 vs. limit=15.0 2023-10-09 14:24:08,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.16 vs. limit=22.5 2023-10-09 14:24:25,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=37940.0, ans=0.0 2023-10-09 14:24:53,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=38033.333333333336, ans=0.125 2023-10-09 14:25:05,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=38033.333333333336, ans=0.125 2023-10-09 14:25:06,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=38033.333333333336, ans=12.0 2023-10-09 14:25:37,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38126.666666666664, ans=0.1 2023-10-09 14:25:45,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=38173.333333333336, ans=0.125 2023-10-09 14:25:53,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-09 14:26:01,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38220.0, ans=0.1 2023-10-09 14:26:18,329 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.304e+02 2.730e+02 3.039e+02 3.935e+02, threshold=5.461e+02, percent-clipped=0.0 2023-10-09 14:26:20,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=38313.333333333336, ans=0.1 2023-10-09 14:26:23,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38313.333333333336, ans=0.125 2023-10-09 14:26:30,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=38313.333333333336, ans=0.0 2023-10-09 14:27:14,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=38500.0, ans=0.1 2023-10-09 14:28:10,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=38733.333333333336, ans=0.125 2023-10-09 14:28:14,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.320e+02 2.691e+02 3.231e+02 4.759e+02, threshold=5.382e+02, percent-clipped=0.0 2023-10-09 14:28:37,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-09 14:28:45,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=38873.333333333336, ans=0.125 2023-10-09 14:28:47,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=38873.333333333336, ans=0.125 2023-10-09 14:29:15,239 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:29:15,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39013.333333333336, ans=0.1 2023-10-09 14:29:20,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-10-09 14:29:30,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=39060.0, ans=0.2 2023-10-09 14:29:30,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=39060.0, ans=0.002378260869565217 2023-10-09 14:29:44,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=39106.666666666664, ans=0.125 2023-10-09 14:29:59,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=39153.333333333336, ans=0.125 2023-10-09 14:30:14,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.488e+02 2.819e+02 3.413e+02 5.214e+02, threshold=5.637e+02, percent-clipped=0.0 2023-10-09 14:30:20,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=39246.666666666664, ans=0.0 2023-10-09 14:30:38,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=39293.333333333336, ans=0.0 2023-10-09 14:30:45,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=39340.0, ans=0.002317391304347826 2023-10-09 14:30:47,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39340.0, ans=0.1 2023-10-09 14:30:56,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39386.666666666664, ans=0.1 2023-10-09 14:31:10,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=39433.333333333336, ans=0.125 2023-10-09 14:31:28,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=39526.666666666664, ans=0.002276811594202899 2023-10-09 14:31:33,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39526.666666666664, ans=0.1 2023-10-09 14:31:41,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=39573.333333333336, ans=0.0 2023-10-09 14:31:52,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-10-09 14:31:56,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.50 vs. limit=22.5 2023-10-09 14:32:08,191 INFO [train.py:1031] (3/4) Epoch 1, batch 8500, loss[loss=0.3795, simple_loss=0.4253, pruned_loss=0.1668, over 16858.00 frames. ], tot_loss[loss=0.3752, simple_loss=0.4193, pruned_loss=0.1676, over 32343518.06 frames. ], batch size: 175, lr: 3.66e-02, grad_scale: 32.0 2023-10-09 14:32:17,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.643e+02 2.917e+02 3.410e+02 6.077e+02, threshold=5.834e+02, percent-clipped=2.0 2023-10-09 14:32:42,569 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:32:53,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-10-09 14:32:54,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=39853.333333333336, ans=0.0 2023-10-09 14:33:10,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=39900.0, ans=0.05 2023-10-09 14:33:21,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=39946.666666666664, ans=0.125 2023-10-09 14:33:30,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=39993.333333333336, ans=0.09899494936611666 2023-10-09 14:34:19,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40133.333333333336, ans=0.1 2023-10-09 14:34:22,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40133.333333333336, ans=0.1 2023-10-09 14:34:27,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.16 vs. limit=15.0 2023-10-09 14:34:33,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.506e+02 2.837e+02 3.411e+02 4.932e+02, threshold=5.674e+02, percent-clipped=0.0 2023-10-09 14:34:57,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.51 vs. limit=10.0 2023-10-09 14:35:11,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=40273.333333333336, ans=0.0 2023-10-09 14:35:20,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.63 vs. limit=15.0 2023-10-09 14:35:20,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40320.0, ans=0.125 2023-10-09 14:35:21,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=40320.0, ans=0.125 2023-10-09 14:35:37,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=40366.666666666664, ans=0.125 2023-10-09 14:35:44,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-10-09 14:36:12,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=40506.666666666664, ans=0.125 2023-10-09 14:36:44,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.73 vs. limit=10.0 2023-10-09 14:36:45,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.105e+02 2.424e+02 2.837e+02 5.498e+02, threshold=4.848e+02, percent-clipped=0.0 2023-10-09 14:36:57,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=40646.666666666664, ans=0.0 2023-10-09 14:36:59,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2023-10-09 14:37:18,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=40740.0, ans=0.0 2023-10-09 14:37:19,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.11 vs. limit=15.0 2023-10-09 14:37:49,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-09 14:37:51,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=40880.0, ans=0.125 2023-10-09 14:37:55,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2023-10-09 14:38:07,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=40926.666666666664, ans=0.125 2023-10-09 14:38:15,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=40926.666666666664, ans=0.125 2023-10-09 14:38:53,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=41066.666666666664, ans=0.0019420289855072471 2023-10-09 14:38:57,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.334e+02 2.630e+02 3.066e+02 4.793e+02, threshold=5.261e+02, percent-clipped=0.0 2023-10-09 14:39:00,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.09 vs. limit=15.0 2023-10-09 14:39:36,902 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=12.0 2023-10-09 14:39:39,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=41253.333333333336, ans=0.125 2023-10-09 14:39:42,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=41253.333333333336, ans=0.2 2023-10-09 14:40:08,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=41393.333333333336, ans=0.125 2023-10-09 14:40:14,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=41393.333333333336, ans=0.0 2023-10-09 14:40:35,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=41486.666666666664, ans=0.125 2023-10-09 14:40:47,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=41533.333333333336, ans=0.0 2023-10-09 14:40:53,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41533.333333333336, ans=0.125 2023-10-09 14:40:56,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.621e+02 2.989e+02 3.508e+02 5.740e+02, threshold=5.978e+02, percent-clipped=1.0 2023-10-09 14:41:29,897 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2023-10-09 14:41:36,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=41720.0, ans=0.125 2023-10-09 14:41:42,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=41720.0, ans=0.0 2023-10-09 14:42:24,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=41906.666666666664, ans=0.125 2023-10-09 14:42:36,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41953.333333333336, ans=0.125 2023-10-09 14:42:37,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=41953.333333333336, ans=0.125 2023-10-09 14:42:40,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=41953.333333333336, ans=0.125 2023-10-09 14:42:42,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41953.333333333336, ans=0.125 2023-10-09 14:42:44,025 INFO [train.py:1031] (3/4) Epoch 1, batch 9000, loss[loss=0.3711, simple_loss=0.4268, pruned_loss=0.1577, over 16779.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.4157, pruned_loss=0.1638, over 32457358.94 frames. ], batch size: 188, lr: 3.60e-02, grad_scale: 32.0 2023-10-09 14:42:52,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.488e+02 2.853e+02 3.353e+02 4.545e+02, threshold=5.705e+02, percent-clipped=0.0 2023-10-09 14:43:17,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.21 vs. limit=15.0 2023-10-09 14:43:19,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-09 14:43:20,417 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:43:22,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=42140.0, ans=0.0017086956521739118 2023-10-09 14:43:31,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-10-09 14:43:31,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=42186.666666666664, ans=0.2 2023-10-09 14:43:36,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42186.666666666664, ans=0.1 2023-10-09 14:43:40,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=42233.333333333336, ans=0.0 2023-10-09 14:43:56,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=42280.0, ans=0.0 2023-10-09 14:44:07,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=42326.666666666664, ans=0.125 2023-10-09 14:44:10,815 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:44:31,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-10-09 14:44:40,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.328e+02 2.634e+02 3.216e+02 4.598e+02, threshold=5.268e+02, percent-clipped=0.0 2023-10-09 14:44:52,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.15 vs. limit=15.0 2023-10-09 14:45:08,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=42606.666666666664, ans=0.0 2023-10-09 14:45:14,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=42606.666666666664, ans=0.0016072463768115938 2023-10-09 14:45:19,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=42653.333333333336, ans=0.125 2023-10-09 14:45:28,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=15.0 2023-10-09 14:45:30,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-10-09 14:45:56,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42793.333333333336, ans=0.1 2023-10-09 14:46:05,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=42840.0, ans=0.125 2023-10-09 14:46:05,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=15.0 2023-10-09 14:46:30,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.84 vs. limit=22.5 2023-10-09 14:46:32,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.256e+02 2.626e+02 3.016e+02 5.567e+02, threshold=5.252e+02, percent-clipped=1.0 2023-10-09 14:46:34,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=42980.0, ans=0.0 2023-10-09 14:46:35,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42980.0, ans=0.1 2023-10-09 14:46:39,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=42980.0, ans=0.0 2023-10-09 14:46:41,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=42980.0, ans=0.125 2023-10-09 14:47:08,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=43073.333333333336, ans=0.0015057971014492758 2023-10-09 14:47:15,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=43120.0, ans=0.2 2023-10-09 14:47:26,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=43166.666666666664, ans=10.0 2023-10-09 14:47:26,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=43166.666666666664, ans=0.125 2023-10-09 14:47:31,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=43166.666666666664, ans=0.125 2023-10-09 14:47:36,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43213.333333333336, ans=0.125 2023-10-09 14:47:36,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=43213.333333333336, ans=0.125 2023-10-09 14:47:41,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43213.333333333336, ans=0.1 2023-10-09 14:48:07,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=43306.666666666664, ans=0.125 2023-10-09 14:48:12,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-10-09 14:48:13,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43353.333333333336, ans=0.1 2023-10-09 14:48:13,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=43353.333333333336, ans=0.0 2023-10-09 14:48:24,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43400.0, ans=0.1 2023-10-09 14:48:28,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.326e+02 2.606e+02 2.984e+02 6.740e+02, threshold=5.211e+02, percent-clipped=2.0 2023-10-09 14:48:40,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.00 vs. limit=15.0 2023-10-09 14:48:50,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=43493.333333333336, ans=0.125 2023-10-09 14:49:15,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=43586.666666666664, ans=0.2 2023-10-09 14:49:20,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=43633.333333333336, ans=0.125 2023-10-09 14:49:28,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=43633.333333333336, ans=0.015 2023-10-09 14:49:32,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=43680.0, ans=0.00137391304347826 2023-10-09 14:49:35,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43680.0, ans=0.1 2023-10-09 14:49:37,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43680.0, ans=0.125 2023-10-09 14:49:57,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-10-09 14:50:02,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=43773.333333333336, ans=0.04949747468305833 2023-10-09 14:50:04,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=43773.333333333336, ans=0.2 2023-10-09 14:50:14,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43820.0, ans=0.1 2023-10-09 14:50:24,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=43866.666666666664, ans=0.2 2023-10-09 14:50:30,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-10-09 14:50:33,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.309e+02 2.707e+02 3.237e+02 5.607e+02, threshold=5.413e+02, percent-clipped=2.0 2023-10-09 14:50:42,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=43913.333333333336, ans=0.2 2023-10-09 14:50:50,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=43960.0, ans=0.125 2023-10-09 14:50:55,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=43960.0, ans=0.05 2023-10-09 14:51:08,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=44006.666666666664, ans=0.2 2023-10-09 14:51:08,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=44006.666666666664, ans=0.125 2023-10-09 14:51:18,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=44053.333333333336, ans=0.125 2023-10-09 14:51:19,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=44053.333333333336, ans=0.125 2023-10-09 14:51:28,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=44100.0, ans=0.125 2023-10-09 14:51:29,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=44100.0, ans=0.001282608695652174 2023-10-09 14:51:43,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44146.666666666664, ans=0.1 2023-10-09 14:51:47,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=44193.333333333336, ans=0.05 2023-10-09 14:52:06,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=44240.0, ans=0.035 2023-10-09 14:52:11,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44240.0, ans=0.1 2023-10-09 14:52:12,561 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.06 vs. limit=15.0 2023-10-09 14:52:23,940 INFO [train.py:1031] (3/4) Epoch 1, batch 9500, loss[loss=0.3348, simple_loss=0.3997, pruned_loss=0.1349, over 16900.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.4138, pruned_loss=0.1612, over 32523068.10 frames. ], batch size: 93, lr: 3.54e-02, grad_scale: 32.0 2023-10-09 14:52:30,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=44333.333333333336, ans=0.125 2023-10-09 14:52:32,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 2.425e+02 2.907e+02 3.525e+02 5.224e+02, threshold=5.814e+02, percent-clipped=0.0 2023-10-09 14:52:34,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=44380.0, ans=0.0 2023-10-09 14:52:49,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=44426.666666666664, ans=0.0 2023-10-09 14:52:56,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.05 vs. limit=10.0 2023-10-09 14:53:02,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44473.333333333336, ans=0.125 2023-10-09 14:53:05,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=44473.333333333336, ans=0.2 2023-10-09 14:53:51,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-10-09 14:54:54,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.395e+02 2.754e+02 3.361e+02 6.650e+02, threshold=5.508e+02, percent-clipped=2.0 2023-10-09 14:54:58,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=44846.666666666664, ans=0.2 2023-10-09 14:55:18,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=44893.333333333336, ans=0.04949747468305833 2023-10-09 14:55:19,357 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:55:31,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=44940.0, ans=10.0 2023-10-09 14:55:45,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=44986.666666666664, ans=0.04949747468305833 2023-10-09 14:56:09,085 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:56:20,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45126.666666666664, ans=0.1 2023-10-09 14:56:20,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=45126.666666666664, ans=0.125 2023-10-09 14:56:22,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45173.333333333336, ans=0.1 2023-10-09 14:56:32,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=45220.0, ans=0.0 2023-10-09 14:56:32,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=24.79 vs. limit=15.0 2023-10-09 14:56:34,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=45220.0, ans=0.0 2023-10-09 14:56:37,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=45220.0, ans=0.0 2023-10-09 14:56:39,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=45220.0, ans=0.025 2023-10-09 14:56:39,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=15.0 2023-10-09 14:56:47,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=45266.666666666664, ans=0.0 2023-10-09 14:56:54,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 2.264e+02 2.576e+02 2.952e+02 4.704e+02, threshold=5.152e+02, percent-clipped=0.0 2023-10-09 14:57:04,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=45313.333333333336, ans=0.125 2023-10-09 14:57:06,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=45360.0, ans=0.125 2023-10-09 14:57:12,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=45360.0, ans=0.0 2023-10-09 14:57:16,642 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:57:36,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=45453.333333333336, ans=0.125 2023-10-09 14:57:53,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=45546.666666666664, ans=0.125 2023-10-09 14:58:19,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=22.5 2023-10-09 14:58:34,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=45686.666666666664, ans=0.125 2023-10-09 14:58:49,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.431e+02 2.945e+02 3.236e+02 4.641e+02, threshold=5.891e+02, percent-clipped=0.0 2023-10-09 14:58:49,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=45780.0, ans=0.2 2023-10-09 14:59:07,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=45826.666666666664, ans=0.125 2023-10-09 14:59:21,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.41 vs. limit=22.5 2023-10-09 14:59:32,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.29 vs. limit=15.0 2023-10-09 14:59:36,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=45920.0, ans=0.125 2023-10-09 14:59:58,514 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:00:00,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=46060.0, ans=0.125 2023-10-09 15:00:03,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.28 vs. limit=22.5 2023-10-09 15:00:11,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.13 vs. limit=22.5 2023-10-09 15:00:47,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.469e+02 2.886e+02 3.459e+02 7.037e+02, threshold=5.772e+02, percent-clipped=1.0 2023-10-09 15:00:54,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=46246.666666666664, ans=0.05 2023-10-09 15:01:05,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=46293.333333333336, ans=0.5 2023-10-09 15:01:18,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=46386.666666666664, ans=0.0007855072463768108 2023-10-09 15:01:25,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=46386.666666666664, ans=0.1 2023-10-09 15:01:29,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=46433.333333333336, ans=0.125 2023-10-09 15:01:38,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.43 vs. limit=15.0 2023-10-09 15:01:44,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=46480.0, ans=0.125 2023-10-09 15:01:46,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=46480.0, ans=0.125 2023-10-09 15:01:49,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=46526.666666666664, ans=0.125 2023-10-09 15:01:49,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=46526.666666666664, ans=0.07 2023-10-09 15:02:23,097 INFO [train.py:1031] (3/4) Epoch 1, batch 10000, loss[loss=0.3402, simple_loss=0.3964, pruned_loss=0.142, over 16871.00 frames. ], tot_loss[loss=0.3618, simple_loss=0.4101, pruned_loss=0.1578, over 32607087.65 frames. ], batch size: 98, lr: 3.49e-02, grad_scale: 32.0 2023-10-09 15:02:28,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=46666.666666666664, ans=10.0 2023-10-09 15:02:34,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.382e+02 2.689e+02 3.468e+02 5.379e+02, threshold=5.378e+02, percent-clipped=0.0 2023-10-09 15:02:40,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=46713.333333333336, ans=0.2 2023-10-09 15:02:50,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=46760.0, ans=0.125 2023-10-09 15:02:52,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=46760.0, ans=0.125 2023-10-09 15:02:55,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.64 vs. limit=22.5 2023-10-09 15:02:57,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=46806.666666666664, ans=0.125 2023-10-09 15:03:03,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=46806.666666666664, ans=0.125 2023-10-09 15:03:12,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=46853.333333333336, ans=0.125 2023-10-09 15:03:15,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=46853.333333333336, ans=0.2 2023-10-09 15:03:18,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-10-09 15:03:36,759 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.590e-03 2023-10-09 15:03:40,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46946.666666666664, ans=0.125 2023-10-09 15:04:18,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=47086.666666666664, ans=0.2 2023-10-09 15:04:42,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.373e+02 2.713e+02 3.157e+02 6.043e+02, threshold=5.425e+02, percent-clipped=1.0 2023-10-09 15:04:58,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-10-09 15:04:59,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=47226.666666666664, ans=0.0 2023-10-09 15:05:06,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-10-09 15:05:26,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-10-09 15:06:02,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=47460.0, ans=0.125 2023-10-09 15:06:07,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=47506.666666666664, ans=0.95 2023-10-09 15:06:25,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=47553.333333333336, ans=0.5 2023-10-09 15:06:33,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=47600.0, ans=0.125 2023-10-09 15:06:36,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=47600.0, ans=0.125 2023-10-09 15:06:39,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=22.5 2023-10-09 15:06:39,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=47600.0, ans=0.125 2023-10-09 15:06:44,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.304e+02 2.627e+02 3.209e+02 6.157e+02, threshold=5.254e+02, percent-clipped=2.0 2023-10-09 15:06:57,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2023-10-09 15:07:16,866 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:07:55,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=47833.333333333336, ans=0.04949747468305833 2023-10-09 15:07:59,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=47833.333333333336, ans=0.125 2023-10-09 15:08:08,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=47880.0, ans=0.125 2023-10-09 15:08:16,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=47926.666666666664, ans=0.125 2023-10-09 15:08:19,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47926.666666666664, ans=0.1 2023-10-09 15:08:30,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47973.333333333336, ans=0.1 2023-10-09 15:08:33,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=47973.333333333336, ans=0.125 2023-10-09 15:08:52,606 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:09:04,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=48066.666666666664, ans=0.125 2023-10-09 15:09:09,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=48066.666666666664, ans=0.0 2023-10-09 15:09:09,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.269e+02 2.617e+02 2.997e+02 4.172e+02, threshold=5.235e+02, percent-clipped=0.0 2023-10-09 15:09:10,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=48113.333333333336, ans=0.125 2023-10-09 15:09:22,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=48160.0, ans=0.0 2023-10-09 15:09:43,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48206.666666666664, ans=0.1 2023-10-09 15:09:43,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=48206.666666666664, ans=0.125 2023-10-09 15:10:05,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48300.0, ans=0.125 2023-10-09 15:10:10,060 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.23 vs. limit=22.5 2023-10-09 15:10:16,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=48346.666666666664, ans=0.2 2023-10-09 15:10:18,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=48346.666666666664, ans=0.125 2023-10-09 15:10:27,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=48393.333333333336, ans=0.0 2023-10-09 15:10:33,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=48393.333333333336, ans=0.00034927536231883945 2023-10-09 15:10:33,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48393.333333333336, ans=0.125 2023-10-09 15:11:23,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=48533.333333333336, ans=15.0 2023-10-09 15:11:24,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.304e+02 2.627e+02 3.071e+02 4.931e+02, threshold=5.255e+02, percent-clipped=0.0 2023-10-09 15:11:31,003 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:11:33,371 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-10-09 15:12:03,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=48720.0, ans=0.2 2023-10-09 15:12:08,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48720.0, ans=0.1 2023-10-09 15:12:34,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48813.333333333336, ans=0.125 2023-10-09 15:12:49,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=48906.666666666664, ans=0.0 2023-10-09 15:12:51,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48906.666666666664, ans=0.1 2023-10-09 15:12:56,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48906.666666666664, ans=0.0 2023-10-09 15:12:57,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=48906.666666666664, ans=10.0 2023-10-09 15:13:06,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=48953.333333333336, ans=0.0002275362318840575 2023-10-09 15:13:07,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=48953.333333333336, ans=0.0 2023-10-09 15:13:13,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.25 vs. limit=12.0 2023-10-09 15:13:15,192 INFO [train.py:1031] (3/4) Epoch 1, batch 10500, loss[loss=0.3404, simple_loss=0.4034, pruned_loss=0.1387, over 16872.00 frames. ], tot_loss[loss=0.3587, simple_loss=0.4081, pruned_loss=0.1555, over 32643776.44 frames. ], batch size: 98, lr: 3.43e-02, grad_scale: 32.0 2023-10-09 15:13:24,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.384e+02 2.770e+02 3.489e+02 5.638e+02, threshold=5.540e+02, percent-clipped=1.0 2023-10-09 15:13:31,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=49046.666666666664, ans=0.125 2023-10-09 15:13:34,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=49046.666666666664, ans=0.0 2023-10-09 15:13:34,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.76 vs. limit=15.0 2023-10-09 15:14:02,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=49140.0, ans=0.2 2023-10-09 15:14:08,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=49140.0, ans=0.0 2023-10-09 15:14:09,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.94 vs. limit=10.0 2023-10-09 15:14:11,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=49140.0, ans=0.2 2023-10-09 15:14:11,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49140.0, ans=0.1 2023-10-09 15:14:17,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49186.666666666664, ans=0.1 2023-10-09 15:14:19,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=49186.666666666664, ans=0.0 2023-10-09 15:14:25,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=49186.666666666664, ans=0.125 2023-10-09 15:14:39,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=49280.0, ans=0.125 2023-10-09 15:15:01,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49326.666666666664, ans=0.125 2023-10-09 15:15:01,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=49326.666666666664, ans=0.125 2023-10-09 15:15:08,610 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-09 15:15:23,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=49420.0, ans=0.0 2023-10-09 15:15:35,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=49420.0, ans=0.125 2023-10-09 15:15:42,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=49466.666666666664, ans=0.0 2023-10-09 15:15:46,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-10-09 15:15:47,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.676e+02 2.388e+02 2.777e+02 3.279e+02 5.433e+02, threshold=5.553e+02, percent-clipped=0.0 2023-10-09 15:15:49,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=49513.333333333336, ans=0.07 2023-10-09 15:16:06,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=49560.0, ans=0.04949747468305833 2023-10-09 15:16:27,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=49653.333333333336, ans=0.2 2023-10-09 15:16:30,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=49653.333333333336, ans=0.0 2023-10-09 15:16:46,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=49700.0, ans=0.125 2023-10-09 15:17:14,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=49793.333333333336, ans=0.125 2023-10-09 15:17:17,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=49793.333333333336, ans=0.0 2023-10-09 15:17:17,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=49793.333333333336, ans=0.125 2023-10-09 15:17:23,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=49840.0, ans=0.125 2023-10-09 15:17:48,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=49886.666666666664, ans=0.125 2023-10-09 15:17:56,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=49933.333333333336, ans=0.125 2023-10-09 15:18:01,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.307e+02 2.723e+02 3.167e+02 4.341e+02, threshold=5.446e+02, percent-clipped=0.0 2023-10-09 15:18:03,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-09 15:18:04,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=15.0 2023-10-09 15:18:23,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=50026.666666666664, ans=0.125 2023-10-09 15:18:42,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=50120.0, ans=0.025 2023-10-09 15:18:45,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50120.0, ans=0.1 2023-10-09 15:18:48,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=50120.0, ans=0.125 2023-10-09 15:18:54,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=50166.666666666664, ans=0.0 2023-10-09 15:19:53,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.50 vs. limit=22.5 2023-10-09 15:19:56,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50400.0, ans=0.1 2023-10-09 15:19:59,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.704e+02 2.998e+02 3.579e+02 5.582e+02, threshold=5.996e+02, percent-clipped=1.0 2023-10-09 15:20:07,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=50446.666666666664, ans=0.125 2023-10-09 15:20:08,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50446.666666666664, ans=0.125 2023-10-09 15:20:43,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=50586.666666666664, ans=0.0 2023-10-09 15:20:43,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=50586.666666666664, ans=0.025 2023-10-09 15:20:50,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=50633.333333333336, ans=0.125 2023-10-09 15:21:39,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=50726.666666666664, ans=0.125 2023-10-09 15:21:59,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50820.0, ans=0.125 2023-10-09 15:22:01,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=50820.0, ans=0.0 2023-10-09 15:22:15,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.190e+02 2.491e+02 3.011e+02 5.477e+02, threshold=4.981e+02, percent-clipped=0.0 2023-10-09 15:22:33,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50960.0, ans=0.1 2023-10-09 15:23:35,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=51146.666666666664, ans=0.125 2023-10-09 15:23:47,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=51193.333333333336, ans=0.5 2023-10-09 15:24:04,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=51240.0, ans=0.0 2023-10-09 15:24:05,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.07 vs. limit=12.0 2023-10-09 15:24:19,566 INFO [train.py:1031] (3/4) Epoch 1, batch 11000, loss[loss=0.3063, simple_loss=0.363, pruned_loss=0.1248, over 15694.00 frames. ], tot_loss[loss=0.3566, simple_loss=0.4064, pruned_loss=0.154, over 32663021.55 frames. ], batch size: 35, lr: 3.38e-02, grad_scale: 16.0 2023-10-09 15:24:24,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.98 vs. limit=6.0 2023-10-09 15:24:30,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-10-09 15:24:30,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.405e+02 2.793e+02 3.441e+02 6.201e+02, threshold=5.586e+02, percent-clipped=7.0 2023-10-09 15:24:37,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=51380.0, ans=0.125 2023-10-09 15:24:37,749 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-10-09 15:24:55,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51473.333333333336, ans=0.1 2023-10-09 15:25:10,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=51520.0, ans=0.0 2023-10-09 15:25:10,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51520.0, ans=0.1 2023-10-09 15:25:22,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=51566.666666666664, ans=0.0 2023-10-09 15:25:45,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=15.0 2023-10-09 15:25:49,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-10-09 15:25:57,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.00 vs. limit=22.5 2023-10-09 15:26:18,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=51800.0, ans=0.125 2023-10-09 15:26:18,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=51800.0, ans=0.0 2023-10-09 15:26:25,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=51800.0, ans=0.1 2023-10-09 15:26:33,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.204e+02 2.697e+02 3.012e+02 5.561e+02, threshold=5.394e+02, percent-clipped=0.0 2023-10-09 15:26:38,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=51846.666666666664, ans=0.125 2023-10-09 15:26:39,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=51846.666666666664, ans=0.0 2023-10-09 15:27:35,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=52033.333333333336, ans=0.125 2023-10-09 15:27:42,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=52033.333333333336, ans=0.2 2023-10-09 15:27:51,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=52080.0, ans=0.125 2023-10-09 15:27:53,625 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:28:55,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=52173.333333333336, ans=0.0 2023-10-09 15:29:05,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=52173.333333333336, ans=0.125 2023-10-09 15:29:21,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=52220.0, ans=0.125 2023-10-09 15:29:37,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=52266.666666666664, ans=0.125 2023-10-09 15:29:38,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=52313.333333333336, ans=0.125 2023-10-09 15:29:39,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 2.312e+02 2.746e+02 3.341e+02 5.183e+02, threshold=5.492e+02, percent-clipped=0.0 2023-10-09 15:29:41,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52313.333333333336, ans=0.0 2023-10-09 15:29:42,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52313.333333333336, ans=0.1 2023-10-09 15:30:00,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=52406.666666666664, ans=0.2 2023-10-09 15:30:34,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52546.666666666664, ans=0.1 2023-10-09 15:30:37,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=52546.666666666664, ans=0.125 2023-10-09 15:30:44,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=15.0 2023-10-09 15:30:44,914 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:30:45,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=52593.333333333336, ans=0.125 2023-10-09 15:31:04,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=52640.0, ans=0.0 2023-10-09 15:31:09,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=52640.0, ans=12.0 2023-10-09 15:31:18,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-10-09 15:31:42,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=52686.666666666664, ans=0.0 2023-10-09 15:32:06,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.306e+02 2.674e+02 3.067e+02 4.251e+02, threshold=5.347e+02, percent-clipped=0.0 2023-10-09 15:32:06,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=52780.0, ans=0.0 2023-10-09 15:32:09,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=52780.0, ans=0.125 2023-10-09 15:32:21,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=52826.666666666664, ans=15.0 2023-10-09 15:32:29,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52873.333333333336, ans=0.125 2023-10-09 15:32:29,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52873.333333333336, ans=0.125 2023-10-09 15:32:35,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=52873.333333333336, ans=0.125 2023-10-09 15:32:56,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=52966.666666666664, ans=0.0 2023-10-09 15:33:08,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.42 vs. limit=10.0 2023-10-09 15:33:21,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=53060.0, ans=0.125 2023-10-09 15:33:31,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53060.0, ans=0.125 2023-10-09 15:33:38,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.84 vs. limit=6.0 2023-10-09 15:33:39,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=53106.666666666664, ans=0.125 2023-10-09 15:33:41,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=53106.666666666664, ans=0.2 2023-10-09 15:34:06,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.756e+02 2.296e+02 2.667e+02 3.253e+02 5.370e+02, threshold=5.335e+02, percent-clipped=1.0 2023-10-09 15:34:08,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=53246.666666666664, ans=0.125 2023-10-09 15:34:08,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=53246.666666666664, ans=0.125 2023-10-09 15:34:53,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=53433.333333333336, ans=0.125 2023-10-09 15:34:55,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=53433.333333333336, ans=0.2 2023-10-09 15:35:17,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2023-10-09 15:35:19,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=53526.666666666664, ans=0.125 2023-10-09 15:35:24,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=53526.666666666664, ans=0.125 2023-10-09 15:35:33,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=53573.333333333336, ans=0.125 2023-10-09 15:35:51,472 INFO [train.py:1031] (3/4) Epoch 1, batch 11500, loss[loss=0.3748, simple_loss=0.4191, pruned_loss=0.1653, over 16154.00 frames. ], tot_loss[loss=0.353, simple_loss=0.4038, pruned_loss=0.1515, over 32686529.21 frames. ], batch size: 296, lr: 3.33e-02, grad_scale: 32.0 2023-10-09 15:36:02,860 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-10-09 15:36:04,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.420e+02 2.806e+02 3.329e+02 6.853e+02, threshold=5.613e+02, percent-clipped=1.0 2023-10-09 15:36:18,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2023-10-09 15:36:26,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=53806.666666666664, ans=0.0 2023-10-09 15:37:10,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=53946.666666666664, ans=0.2 2023-10-09 15:37:21,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.68 vs. limit=15.0 2023-10-09 15:37:32,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.09 vs. limit=15.0 2023-10-09 15:37:37,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=15.0 2023-10-09 15:38:06,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.46 vs. limit=15.0 2023-10-09 15:38:06,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 2.295e+02 2.608e+02 3.083e+02 5.065e+02, threshold=5.215e+02, percent-clipped=0.0 2023-10-09 15:38:06,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=54180.0, ans=0.035 2023-10-09 15:38:14,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=54180.0, ans=0.125 2023-10-09 15:38:26,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=54226.666666666664, ans=0.125 2023-10-09 15:38:39,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=54273.333333333336, ans=0.125 2023-10-09 15:39:01,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=54366.666666666664, ans=0.125 2023-10-09 15:39:25,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-10-09 15:39:26,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=54460.0, ans=0.1 2023-10-09 15:39:43,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=54553.333333333336, ans=0.0 2023-10-09 15:39:58,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.292e+02 2.658e+02 3.262e+02 6.196e+02, threshold=5.317e+02, percent-clipped=1.0 2023-10-09 15:40:00,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=15.0 2023-10-09 15:40:03,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=54646.666666666664, ans=0.2 2023-10-09 15:40:20,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=54693.333333333336, ans=0.0 2023-10-09 15:40:48,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=54833.333333333336, ans=0.5 2023-10-09 15:40:57,164 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.30 vs. limit=22.5 2023-10-09 15:41:08,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=54880.0, ans=10.0 2023-10-09 15:41:16,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=54880.0, ans=0.125 2023-10-09 15:41:42,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=55020.0, ans=0.125 2023-10-09 15:41:47,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=55020.0, ans=0.05 2023-10-09 15:41:51,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=55020.0, ans=0.0 2023-10-09 15:42:08,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.279e+02 2.660e+02 3.339e+02 5.764e+02, threshold=5.319e+02, percent-clipped=2.0 2023-10-09 15:42:45,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=55253.333333333336, ans=0.125 2023-10-09 15:43:06,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.89 vs. limit=15.0 2023-10-09 15:43:09,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=55346.666666666664, ans=0.0 2023-10-09 15:43:16,093 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:43:20,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=55393.333333333336, ans=0.125 2023-10-09 15:43:39,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-09 15:43:45,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.93 vs. limit=22.5 2023-10-09 15:43:46,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=55486.666666666664, ans=0.125 2023-10-09 15:43:50,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55486.666666666664, ans=0.1 2023-10-09 15:44:02,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=55533.333333333336, ans=0.125 2023-10-09 15:44:11,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.365e+02 2.721e+02 3.221e+02 4.932e+02, threshold=5.442e+02, percent-clipped=0.0 2023-10-09 15:44:23,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=55580.0, ans=0.0 2023-10-09 15:44:24,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=55580.0, ans=0.0 2023-10-09 15:44:45,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=55673.333333333336, ans=0.125 2023-10-09 15:44:57,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=55720.0, ans=0.125 2023-10-09 15:45:28,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.27 vs. limit=22.5 2023-10-09 15:45:37,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=55813.333333333336, ans=0.125 2023-10-09 15:45:46,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.44 vs. limit=22.5 2023-10-09 15:45:47,077 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:45:48,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-10-09 15:45:51,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=55906.666666666664, ans=0.125 2023-10-09 15:46:12,806 INFO [train.py:1031] (3/4) Epoch 1, batch 12000, loss[loss=0.3792, simple_loss=0.4289, pruned_loss=0.1648, over 16581.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.4016, pruned_loss=0.149, over 32703932.49 frames. ], batch size: 241, lr: 3.28e-02, grad_scale: 32.0 2023-10-09 15:46:30,329 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.448e+02 2.864e+02 3.433e+02 4.684e+02, threshold=5.727e+02, percent-clipped=0.0 2023-10-09 15:46:31,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=56046.666666666664, ans=0.125 2023-10-09 15:46:34,997 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:46:35,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=56046.666666666664, ans=0.0 2023-10-09 15:46:53,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=56140.0, ans=0.125 2023-10-09 15:47:16,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=56186.666666666664, ans=0.125 2023-10-09 15:47:20,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=56233.333333333336, ans=0.0 2023-10-09 15:47:28,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=56280.0, ans=0.0 2023-10-09 15:47:48,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.35 vs. limit=15.0 2023-10-09 15:47:50,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=56326.666666666664, ans=0.0 2023-10-09 15:47:57,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.61 vs. limit=22.5 2023-10-09 15:48:21,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=56466.666666666664, ans=0.125 2023-10-09 15:48:30,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.328e+02 2.638e+02 3.008e+02 5.387e+02, threshold=5.275e+02, percent-clipped=0.0 2023-10-09 15:48:42,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2023-10-09 15:48:42,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56560.0, ans=0.1 2023-10-09 15:48:45,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=56560.0, ans=0.125 2023-10-09 15:48:52,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2023-10-09 15:49:10,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=15.0 2023-10-09 15:49:11,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=56653.333333333336, ans=0.125 2023-10-09 15:50:27,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=56793.333333333336, ans=0.125 2023-10-09 15:50:51,967 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:51:05,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.337e+02 2.722e+02 3.256e+02 4.867e+02, threshold=5.443e+02, percent-clipped=0.0 2023-10-09 15:51:22,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=57026.666666666664, ans=0.0 2023-10-09 15:51:32,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=57073.333333333336, ans=0.0 2023-10-09 15:51:33,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=57073.333333333336, ans=0.0 2023-10-09 15:51:53,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2023-10-09 15:51:53,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=57166.666666666664, ans=0.07 2023-10-09 15:52:18,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=57260.0, ans=0.125 2023-10-09 15:52:21,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=57260.0, ans=0.125 2023-10-09 15:52:36,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57306.666666666664, ans=0.1 2023-10-09 15:52:36,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=57306.666666666664, ans=0.0 2023-10-09 15:52:38,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-10-09 15:53:01,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=57446.666666666664, ans=0.125 2023-10-09 15:53:02,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.420e+02 2.824e+02 3.045e+02 4.744e+02, threshold=5.648e+02, percent-clipped=0.0 2023-10-09 15:53:11,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.33 vs. limit=15.0 2023-10-09 15:53:15,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=57493.333333333336, ans=0.125 2023-10-09 15:53:20,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=57493.333333333336, ans=0.125 2023-10-09 15:54:06,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=57680.0, ans=0.2 2023-10-09 15:54:08,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=57680.0, ans=0.0 2023-10-09 15:54:40,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=57820.0, ans=0.2 2023-10-09 15:55:01,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.539e+02 2.866e+02 3.399e+02 5.145e+02, threshold=5.732e+02, percent-clipped=0.0 2023-10-09 15:55:08,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=57913.333333333336, ans=0.2 2023-10-09 15:55:21,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=58006.666666666664, ans=0.125 2023-10-09 15:55:21,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2023-10-09 15:55:31,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2023-10-09 15:55:43,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58053.333333333336, ans=0.1 2023-10-09 15:56:20,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=58193.333333333336, ans=0.0 2023-10-09 15:56:21,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=58193.333333333336, ans=0.125 2023-10-09 15:56:46,361 INFO [train.py:1031] (3/4) Epoch 1, batch 12500, loss[loss=0.3479, simple_loss=0.4091, pruned_loss=0.1434, over 16860.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3998, pruned_loss=0.1475, over 32738090.03 frames. ], batch size: 188, lr: 3.23e-02, grad_scale: 32.0 2023-10-09 15:56:57,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=58380.0, ans=0.0 2023-10-09 15:56:57,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 2.346e+02 2.659e+02 3.058e+02 4.471e+02, threshold=5.317e+02, percent-clipped=0.0 2023-10-09 15:57:09,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=58426.666666666664, ans=0.125 2023-10-09 15:57:12,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58426.666666666664, ans=0.1 2023-10-09 15:57:18,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=58473.333333333336, ans=0.0 2023-10-09 15:57:44,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=58566.666666666664, ans=0.125 2023-10-09 15:57:50,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=58613.333333333336, ans=0.125 2023-10-09 15:58:02,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.31 vs. limit=15.0 2023-10-09 15:58:25,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=58753.333333333336, ans=0.0 2023-10-09 15:58:27,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=58753.333333333336, ans=0.125 2023-10-09 15:58:49,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=58846.666666666664, ans=0.2 2023-10-09 15:58:51,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.294e+02 2.596e+02 2.990e+02 4.793e+02, threshold=5.193e+02, percent-clipped=0.0 2023-10-09 15:59:00,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=58893.333333333336, ans=0.125 2023-10-09 15:59:33,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=58986.666666666664, ans=15.0 2023-10-09 15:59:35,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=58986.666666666664, ans=0.1 2023-10-09 15:59:53,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=59080.0, ans=0.0 2023-10-09 16:00:10,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=59126.666666666664, ans=0.0 2023-10-09 16:00:11,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=59126.666666666664, ans=0.125 2023-10-09 16:00:21,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=59126.666666666664, ans=0.0 2023-10-09 16:00:38,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=59173.333333333336, ans=0.0 2023-10-09 16:01:15,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59266.666666666664, ans=0.1 2023-10-09 16:01:25,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.297e+02 2.794e+02 3.101e+02 4.689e+02, threshold=5.588e+02, percent-clipped=0.0 2023-10-09 16:01:30,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.88 vs. limit=10.0 2023-10-09 16:02:12,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.14 vs. limit=6.0 2023-10-09 16:02:17,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59500.0, ans=0.1 2023-10-09 16:02:22,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-10-09 16:02:23,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59546.666666666664, ans=0.1 2023-10-09 16:02:33,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-10-09 16:02:34,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-10-09 16:02:37,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.88 vs. limit=15.0 2023-10-09 16:02:39,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59593.333333333336, ans=0.1 2023-10-09 16:02:55,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=59686.666666666664, ans=0.125 2023-10-09 16:03:19,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.339e+02 2.722e+02 3.043e+02 4.542e+02, threshold=5.444e+02, percent-clipped=0.0 2023-10-09 16:04:02,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=59920.0, ans=0.025 2023-10-09 16:04:43,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=60106.666666666664, ans=0.0 2023-10-09 16:04:43,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=60106.666666666664, ans=0.0 2023-10-09 16:04:57,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.81 vs. limit=22.5 2023-10-09 16:05:16,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.361e+02 2.827e+02 3.291e+02 4.917e+02, threshold=5.654e+02, percent-clipped=0.0 2023-10-09 16:05:28,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=60293.333333333336, ans=0.2 2023-10-09 16:05:30,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-09 16:05:47,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60386.666666666664, ans=0.125 2023-10-09 16:05:55,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=12.0 2023-10-09 16:05:59,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=60433.333333333336, ans=0.125 2023-10-09 16:06:03,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60433.333333333336, ans=0.1 2023-10-09 16:06:18,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.91 vs. limit=6.0 2023-10-09 16:06:24,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60526.666666666664, ans=0.1 2023-10-09 16:06:32,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=60573.333333333336, ans=0.125 2023-10-09 16:06:55,239 INFO [train.py:1031] (3/4) Epoch 1, batch 13000, loss[loss=0.3254, simple_loss=0.3824, pruned_loss=0.1342, over 16865.00 frames. ], tot_loss[loss=0.3456, simple_loss=0.399, pruned_loss=0.1463, over 32719932.37 frames. ], batch size: 146, lr: 3.18e-02, grad_scale: 32.0 2023-10-09 16:07:22,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60713.333333333336, ans=0.1 2023-10-09 16:07:23,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.507e+02 3.122e+02 3.868e+02 6.282e+02, threshold=6.244e+02, percent-clipped=3.0 2023-10-09 16:07:30,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=60713.333333333336, ans=0.125 2023-10-09 16:07:46,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=60760.0, ans=0.125 2023-10-09 16:08:01,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=60806.666666666664, ans=0.125 2023-10-09 16:08:08,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.29 vs. limit=22.5 2023-10-09 16:08:17,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60900.0, ans=0.125 2023-10-09 16:08:21,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60900.0, ans=0.1 2023-10-09 16:08:21,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.18 vs. limit=22.5 2023-10-09 16:08:37,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-10-09 16:09:12,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=61086.666666666664, ans=0.07 2023-10-09 16:09:15,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.36 vs. limit=15.0 2023-10-09 16:09:19,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=61133.333333333336, ans=0.2 2023-10-09 16:09:19,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=61133.333333333336, ans=0.2 2023-10-09 16:09:29,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=61133.333333333336, ans=0.125 2023-10-09 16:09:33,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=61180.0, ans=0.2 2023-10-09 16:09:33,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.413e+02 2.867e+02 3.318e+02 4.884e+02, threshold=5.734e+02, percent-clipped=0.0 2023-10-09 16:09:42,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.05 vs. limit=12.0 2023-10-09 16:10:06,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-10-09 16:10:20,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61366.666666666664, ans=0.1 2023-10-09 16:10:31,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=61413.333333333336, ans=0.125 2023-10-09 16:10:44,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=61460.0, ans=0.125 2023-10-09 16:10:50,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=61506.666666666664, ans=0.0 2023-10-09 16:11:04,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2023-10-09 16:11:26,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=61600.0, ans=0.09899494936611666 2023-10-09 16:11:31,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.500e+02 2.860e+02 3.236e+02 4.527e+02, threshold=5.721e+02, percent-clipped=0.0 2023-10-09 16:11:32,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61646.666666666664, ans=0.1 2023-10-09 16:11:51,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.41 vs. limit=22.5 2023-10-09 16:11:52,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=61740.0, ans=0.2 2023-10-09 16:11:59,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=61740.0, ans=0.125 2023-10-09 16:11:59,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=61740.0, ans=0.125 2023-10-09 16:12:04,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=61786.666666666664, ans=0.2 2023-10-09 16:12:16,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.09 vs. limit=22.5 2023-10-09 16:12:18,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=22.5 2023-10-09 16:12:24,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.21 vs. limit=15.0 2023-10-09 16:12:30,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=61880.0, ans=0.0 2023-10-09 16:12:47,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-10-09 16:12:53,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-10-09 16:12:54,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=61973.333333333336, ans=0.0 2023-10-09 16:12:59,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.79 vs. limit=15.0 2023-10-09 16:13:25,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 2.308e+02 2.743e+02 3.164e+02 4.193e+02, threshold=5.485e+02, percent-clipped=0.0 2023-10-09 16:13:25,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=62113.333333333336, ans=0.125 2023-10-09 16:13:35,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=62160.0, ans=0.0 2023-10-09 16:13:41,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=62160.0, ans=0.125 2023-10-09 16:13:46,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=62206.666666666664, ans=0.125 2023-10-09 16:13:56,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=62206.666666666664, ans=22.5 2023-10-09 16:13:56,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=62253.333333333336, ans=0.0 2023-10-09 16:14:05,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.34 vs. limit=22.5 2023-10-09 16:14:06,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=62253.333333333336, ans=0.0 2023-10-09 16:14:17,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=62300.0, ans=0.125 2023-10-09 16:14:30,662 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:14:33,356 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.38 vs. limit=22.5 2023-10-09 16:15:05,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=62486.666666666664, ans=0.2 2023-10-09 16:15:14,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62486.666666666664, ans=0.1 2023-10-09 16:15:30,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.302e+02 2.597e+02 2.889e+02 4.663e+02, threshold=5.193e+02, percent-clipped=0.0 2023-10-09 16:15:36,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=62580.0, ans=0.125 2023-10-09 16:15:52,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=62673.333333333336, ans=0.0 2023-10-09 16:15:59,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=62673.333333333336, ans=0.0 2023-10-09 16:16:09,241 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:16:11,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=62720.0, ans=0.0 2023-10-09 16:16:29,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=62813.333333333336, ans=0.125 2023-10-09 16:17:01,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=62953.333333333336, ans=0.125 2023-10-09 16:17:06,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62953.333333333336, ans=0.1 2023-10-09 16:17:11,397 INFO [train.py:1031] (3/4) Epoch 1, batch 13500, loss[loss=0.3078, simple_loss=0.3754, pruned_loss=0.1201, over 16982.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.3966, pruned_loss=0.1444, over 32739267.49 frames. ], batch size: 93, lr: 3.14e-02, grad_scale: 32.0 2023-10-09 16:17:15,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=63000.0, ans=0.125 2023-10-09 16:17:22,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=63046.666666666664, ans=0.0 2023-10-09 16:17:23,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.381e+02 2.740e+02 3.450e+02 5.521e+02, threshold=5.480e+02, percent-clipped=2.0 2023-10-09 16:17:34,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63093.333333333336, ans=0.125 2023-10-09 16:17:44,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.07 vs. limit=15.0 2023-10-09 16:18:17,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=63233.333333333336, ans=0.2 2023-10-09 16:18:31,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=63280.0, ans=0.125 2023-10-09 16:18:34,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63326.666666666664, ans=0.1 2023-10-09 16:18:40,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=63326.666666666664, ans=0.125 2023-10-09 16:18:40,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=63326.666666666664, ans=0.0 2023-10-09 16:18:56,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.63 vs. limit=22.5 2023-10-09 16:19:21,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.355e+02 2.753e+02 3.090e+02 4.398e+02, threshold=5.506e+02, percent-clipped=0.0 2023-10-09 16:19:28,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=63513.333333333336, ans=0.125 2023-10-09 16:19:32,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63560.0, ans=0.1 2023-10-09 16:19:37,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.60 vs. limit=22.5 2023-10-09 16:19:54,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=63653.333333333336, ans=0.125 2023-10-09 16:19:54,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=12.0 2023-10-09 16:20:40,210 INFO [train.py:1031] (3/4) Epoch 2, batch 0, loss[loss=0.3205, simple_loss=0.3796, pruned_loss=0.1307, over 16592.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3796, pruned_loss=0.1307, over 16592.00 frames. ], batch size: 61, lr: 2.63e-02, grad_scale: 32.0 2023-10-09 16:20:40,211 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-09 16:20:46,721 INFO [zipformer.py:1853] (3/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.6584, 4.6970, 2.1411, 4.4273], device='cuda:3') 2023-10-09 16:20:48,149 INFO [train.py:1063] (3/4) Epoch 2, validation: loss=0.3074, simple_loss=0.3842, pruned_loss=0.1153, over 1020973.00 frames. 2023-10-09 16:20:48,150 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16317MB 2023-10-09 16:21:07,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=63770.0, ans=0.0 2023-10-09 16:21:15,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=63816.666666666664, ans=0.125 2023-10-09 16:21:20,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=63816.666666666664, ans=0.125 2023-10-09 16:21:25,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.85 vs. limit=6.0 2023-10-09 16:21:27,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=63863.333333333336, ans=0.125 2023-10-09 16:21:33,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=63863.333333333336, ans=0.125 2023-10-09 16:21:33,582 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2023-10-09 16:21:43,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=63910.0, ans=0.0 2023-10-09 16:21:53,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=63956.666666666664, ans=0.2 2023-10-09 16:21:57,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.350e+02 2.672e+02 3.244e+02 4.748e+02, threshold=5.344e+02, percent-clipped=0.0 2023-10-09 16:22:13,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=64003.333333333336, ans=0.1 2023-10-09 16:22:13,998 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:22:24,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=64050.0, ans=0.0 2023-10-09 16:22:42,417 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:22:53,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=64190.0, ans=0.0 2023-10-09 16:23:01,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=64236.666666666664, ans=0.0 2023-10-09 16:23:01,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=64236.666666666664, ans=0.0 2023-10-09 16:23:09,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=64236.666666666664, ans=0.025 2023-10-09 16:23:11,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=64236.666666666664, ans=0.0 2023-10-09 16:23:36,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=64376.666666666664, ans=0.125 2023-10-09 16:23:39,913 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=15.0 2023-10-09 16:23:53,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-09 16:23:54,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.175e+02 2.474e+02 2.891e+02 4.689e+02, threshold=4.948e+02, percent-clipped=0.0 2023-10-09 16:23:54,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=64423.333333333336, ans=0.0 2023-10-09 16:23:57,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=64470.0, ans=0.125 2023-10-09 16:24:00,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=64470.0, ans=0.2 2023-10-09 16:24:16,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64516.666666666664, ans=0.1 2023-10-09 16:25:04,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=64750.0, ans=15.0 2023-10-09 16:25:22,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-10-09 16:25:39,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64843.333333333336, ans=0.1 2023-10-09 16:25:45,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=64890.0, ans=0.125 2023-10-09 16:25:48,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.049e+02 2.283e+02 2.687e+02 4.170e+02, threshold=4.566e+02, percent-clipped=0.0 2023-10-09 16:25:51,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=64936.666666666664, ans=0.125 2023-10-09 16:26:07,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=64983.333333333336, ans=0.0 2023-10-09 16:26:20,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=65030.0, ans=0.125 2023-10-09 16:26:31,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=65076.666666666664, ans=0.125 2023-10-09 16:26:41,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=65123.333333333336, ans=0.0 2023-10-09 16:26:53,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=65170.0, ans=0.07 2023-10-09 16:27:11,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-10-09 16:27:18,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-10-09 16:27:34,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 2.170e+02 2.445e+02 2.887e+02 4.355e+02, threshold=4.890e+02, percent-clipped=0.0 2023-10-09 16:27:40,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=65403.333333333336, ans=0.025 2023-10-09 16:28:07,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.15 vs. limit=15.0 2023-10-09 16:28:27,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=65590.0, ans=0.0 2023-10-09 16:28:34,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=65636.66666666667, ans=0.07 2023-10-09 16:28:42,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=65683.33333333333, ans=0.0 2023-10-09 16:28:52,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=15.0 2023-10-09 16:28:53,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=65730.0, ans=0.0 2023-10-09 16:28:54,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.34 vs. limit=22.5 2023-10-09 16:29:23,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.212e+02 2.551e+02 2.907e+02 5.719e+02, threshold=5.101e+02, percent-clipped=2.0 2023-10-09 16:29:23,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=65823.33333333333, ans=0.125 2023-10-09 16:29:27,750 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.99 vs. limit=22.5 2023-10-09 16:29:34,395 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:29:36,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=65870.0, ans=0.0 2023-10-09 16:29:55,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=65963.33333333333, ans=0.0 2023-10-09 16:30:13,121 INFO [train.py:1031] (3/4) Epoch 2, batch 500, loss[loss=0.2849, simple_loss=0.3566, pruned_loss=0.1066, over 16896.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3828, pruned_loss=0.1317, over 7287507.41 frames. ], batch size: 123, lr: 2.59e-02, grad_scale: 32.0 2023-10-09 16:30:16,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=66056.66666666667, ans=0.125 2023-10-09 16:30:26,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=66103.33333333333, ans=0.0 2023-10-09 16:30:27,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=66103.33333333333, ans=0.0 2023-10-09 16:30:36,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=66150.0, ans=0.125 2023-10-09 16:30:46,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=66196.66666666667, ans=0.0 2023-10-09 16:30:56,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2023-10-09 16:31:00,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=66243.33333333333, ans=0.125 2023-10-09 16:31:04,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=66243.33333333333, ans=0.125 2023-10-09 16:31:10,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=66290.0, ans=0.07 2023-10-09 16:31:15,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.168e+02 2.544e+02 2.989e+02 5.347e+02, threshold=5.087e+02, percent-clipped=1.0 2023-10-09 16:31:16,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66290.0, ans=0.125 2023-10-09 16:31:19,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=66336.66666666667, ans=0.125 2023-10-09 16:31:31,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=66336.66666666667, ans=0.07 2023-10-09 16:31:31,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66336.66666666667, ans=0.1 2023-10-09 16:32:13,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=66476.66666666667, ans=0.0 2023-10-09 16:32:18,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.72 vs. limit=15.0 2023-10-09 16:32:23,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=66523.33333333333, ans=0.125 2023-10-09 16:32:37,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.83 vs. limit=10.0 2023-10-09 16:32:38,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=66570.0, ans=0.0 2023-10-09 16:32:45,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=66616.66666666667, ans=0.125 2023-10-09 16:33:09,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-10-09 16:33:17,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2023-10-09 16:33:23,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.204e+02 2.493e+02 2.814e+02 4.565e+02, threshold=4.986e+02, percent-clipped=0.0 2023-10-09 16:33:25,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66756.66666666667, ans=0.1 2023-10-09 16:33:37,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=66803.33333333333, ans=0.125 2023-10-09 16:33:43,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=66850.0, ans=0.0 2023-10-09 16:33:43,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=66850.0, ans=0.125 2023-10-09 16:33:45,353 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:34:19,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=66990.0, ans=0.125 2023-10-09 16:34:30,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=67036.66666666667, ans=0.2 2023-10-09 16:34:35,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=67036.66666666667, ans=0.0 2023-10-09 16:34:39,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=67036.66666666667, ans=0.125 2023-10-09 16:34:41,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=67083.33333333333, ans=0.125 2023-10-09 16:34:41,997 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:34:54,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=67130.0, ans=0.025 2023-10-09 16:35:27,452 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.208e+02 2.513e+02 2.866e+02 4.304e+02, threshold=5.027e+02, percent-clipped=0.0 2023-10-09 16:35:28,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-10-09 16:36:01,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=67410.0, ans=0.0 2023-10-09 16:36:15,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=67456.66666666667, ans=0.0 2023-10-09 16:36:42,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=67550.0, ans=0.0 2023-10-09 16:36:48,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=67550.0, ans=0.125 2023-10-09 16:37:03,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=67596.66666666667, ans=0.0 2023-10-09 16:37:07,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=67596.66666666667, ans=0.125 2023-10-09 16:37:27,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.224e+02 2.569e+02 2.872e+02 4.220e+02, threshold=5.137e+02, percent-clipped=0.0 2023-10-09 16:37:36,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-10-09 16:37:38,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67736.66666666667, ans=0.1 2023-10-09 16:37:41,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=67783.33333333333, ans=0.0 2023-10-09 16:37:42,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2023-10-09 16:37:45,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=67783.33333333333, ans=0.0 2023-10-09 16:37:57,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=67830.0, ans=0.0 2023-10-09 16:38:14,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-10-09 16:38:16,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=67923.33333333333, ans=0.0 2023-10-09 16:38:41,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=68016.66666666667, ans=0.0 2023-10-09 16:38:43,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=68016.66666666667, ans=0.0 2023-10-09 16:38:59,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-10-09 16:39:25,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=68156.66666666667, ans=0.125 2023-10-09 16:39:27,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 2.089e+02 2.451e+02 2.809e+02 3.972e+02, threshold=4.902e+02, percent-clipped=0.0 2023-10-09 16:39:36,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=68203.33333333333, ans=0.0 2023-10-09 16:39:56,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.50 vs. limit=15.0 2023-10-09 16:39:57,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=68296.66666666667, ans=0.125 2023-10-09 16:39:57,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2023-10-09 16:40:01,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.75 vs. limit=15.0 2023-10-09 16:40:16,153 INFO [train.py:1031] (3/4) Epoch 2, batch 1000, loss[loss=0.3166, simple_loss=0.385, pruned_loss=0.1241, over 16791.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3811, pruned_loss=0.1296, over 12953400.59 frames. ], batch size: 87, lr: 2.55e-02, grad_scale: 32.0 2023-10-09 16:40:18,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=68390.0, ans=0.0 2023-10-09 16:40:26,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68436.66666666667, ans=0.1 2023-10-09 16:40:41,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=68483.33333333333, ans=0.0 2023-10-09 16:40:41,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=68483.33333333333, ans=0.125 2023-10-09 16:41:04,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68576.66666666667, ans=0.1 2023-10-09 16:41:06,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=68576.66666666667, ans=0.125 2023-10-09 16:41:17,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=68623.33333333333, ans=0.125 2023-10-09 16:41:18,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 2.111e+02 2.370e+02 2.649e+02 5.291e+02, threshold=4.741e+02, percent-clipped=2.0 2023-10-09 16:41:23,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.86 vs. limit=15.0 2023-10-09 16:41:29,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=68670.0, ans=0.2 2023-10-09 16:41:40,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-09 16:41:42,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-10-09 16:41:49,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=68763.33333333333, ans=0.0 2023-10-09 16:41:53,356 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.14 vs. limit=15.0 2023-10-09 16:41:56,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=68810.0, ans=0.5 2023-10-09 16:42:11,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=68856.66666666667, ans=0.2 2023-10-09 16:42:13,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=68903.33333333333, ans=0.125 2023-10-09 16:42:15,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=68903.33333333333, ans=0.0 2023-10-09 16:42:36,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=68950.0, ans=0.0 2023-10-09 16:42:42,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=68996.66666666667, ans=0.125 2023-10-09 16:43:11,375 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-10-09 16:43:12,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.95 vs. limit=6.0 2023-10-09 16:43:13,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.259e+02 2.528e+02 3.055e+02 4.519e+02, threshold=5.057e+02, percent-clipped=0.0 2023-10-09 16:43:24,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=69136.66666666667, ans=0.125 2023-10-09 16:43:25,009 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:43:32,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.16 vs. limit=15.0 2023-10-09 16:43:34,500 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:43:42,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.07 vs. limit=15.0 2023-10-09 16:43:47,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69230.0, ans=0.1 2023-10-09 16:44:08,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=69276.66666666667, ans=0.125 2023-10-09 16:44:18,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=69323.33333333333, ans=0.0 2023-10-09 16:44:27,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=69370.0, ans=0.0 2023-10-09 16:45:03,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=69510.0, ans=0.125 2023-10-09 16:45:06,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-10-09 16:45:06,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=69510.0, ans=0.125 2023-10-09 16:45:13,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.58 vs. limit=15.0 2023-10-09 16:45:22,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.974e+02 2.158e+02 2.514e+02 3.561e+02, threshold=4.315e+02, percent-clipped=0.0 2023-10-09 16:46:11,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=69743.33333333333, ans=0.1 2023-10-09 16:46:17,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.09 vs. limit=15.0 2023-10-09 16:46:18,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=69790.0, ans=0.1 2023-10-09 16:46:19,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=69790.0, ans=0.2 2023-10-09 16:46:34,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=69836.66666666667, ans=0.125 2023-10-09 16:46:35,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69836.66666666667, ans=0.1 2023-10-09 16:46:52,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=69930.0, ans=0.0 2023-10-09 16:46:54,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=69930.0, ans=0.125 2023-10-09 16:47:14,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=70023.33333333333, ans=0.125 2023-10-09 16:47:18,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=70023.33333333333, ans=0.0 2023-10-09 16:47:22,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 2.290e+02 2.600e+02 2.934e+02 4.496e+02, threshold=5.200e+02, percent-clipped=2.0 2023-10-09 16:47:29,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=12.0 2023-10-09 16:48:02,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70210.0, ans=0.125 2023-10-09 16:48:16,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=70256.66666666667, ans=0.0 2023-10-09 16:48:24,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70303.33333333333, ans=0.1 2023-10-09 16:48:35,525 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:48:50,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.94 vs. limit=10.0 2023-10-09 16:49:19,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.147e+02 2.406e+02 2.659e+02 4.241e+02, threshold=4.813e+02, percent-clipped=0.0 2023-10-09 16:49:29,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=70536.66666666667, ans=0.0 2023-10-09 16:49:56,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.95 vs. limit=22.5 2023-10-09 16:50:13,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=70630.0, ans=0.125 2023-10-09 16:50:23,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=15.0 2023-10-09 16:50:24,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=70676.66666666667, ans=0.2 2023-10-09 16:50:25,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=70676.66666666667, ans=0.125 2023-10-09 16:50:25,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=70676.66666666667, ans=0.0 2023-10-09 16:50:33,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=70723.33333333333, ans=0.0 2023-10-09 16:50:33,838 INFO [train.py:1031] (3/4) Epoch 2, batch 1500, loss[loss=0.2932, simple_loss=0.3618, pruned_loss=0.1123, over 16959.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3773, pruned_loss=0.1268, over 17356409.03 frames. ], batch size: 123, lr: 2.52e-02, grad_scale: 32.0 2023-10-09 16:50:38,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70723.33333333333, ans=0.125 2023-10-09 16:50:50,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=70770.0, ans=0.1 2023-10-09 16:51:00,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=70816.66666666667, ans=0.2 2023-10-09 16:51:13,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-10-09 16:51:15,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=70863.33333333333, ans=0.125 2023-10-09 16:51:16,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=70863.33333333333, ans=0.1 2023-10-09 16:51:32,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=70910.0, ans=0.125 2023-10-09 16:51:40,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=22.5 2023-10-09 16:51:53,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.134e+02 2.459e+02 2.793e+02 3.782e+02, threshold=4.918e+02, percent-clipped=0.0 2023-10-09 16:51:54,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=70956.66666666667, ans=0.5 2023-10-09 16:52:00,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=71003.33333333333, ans=0.2 2023-10-09 16:52:11,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=71050.0, ans=0.0 2023-10-09 16:52:14,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-10-09 16:52:16,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=71050.0, ans=0.1 2023-10-09 16:52:18,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-09 16:52:28,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=71096.66666666667, ans=0.05 2023-10-09 16:52:32,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=71143.33333333333, ans=0.2 2023-10-09 16:52:35,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=71143.33333333333, ans=0.02 2023-10-09 16:52:36,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=71143.33333333333, ans=0.0 2023-10-09 16:52:38,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=71143.33333333333, ans=0.125 2023-10-09 16:52:58,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=71236.66666666667, ans=0.125 2023-10-09 16:53:12,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=71283.33333333333, ans=10.0 2023-10-09 16:53:12,745 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:53:14,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=71283.33333333333, ans=0.125 2023-10-09 16:53:19,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.75 vs. limit=10.0 2023-10-09 16:53:22,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-10-09 16:53:29,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=71330.0, ans=0.0 2023-10-09 16:53:58,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.153e+02 2.464e+02 2.808e+02 3.679e+02, threshold=4.927e+02, percent-clipped=0.0 2023-10-09 16:54:05,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=71470.0, ans=0.125 2023-10-09 16:54:18,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=71516.66666666667, ans=0.125 2023-10-09 16:54:49,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71656.66666666667, ans=0.1 2023-10-09 16:55:14,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71750.0, ans=0.1 2023-10-09 16:55:20,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=71750.0, ans=0.0 2023-10-09 16:55:25,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=71796.66666666667, ans=0.0 2023-10-09 16:55:34,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=12.0 2023-10-09 16:55:44,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=71890.0, ans=0.125 2023-10-09 16:55:50,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.360e+02 2.595e+02 3.075e+02 4.930e+02, threshold=5.190e+02, percent-clipped=1.0 2023-10-09 16:55:53,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=71936.66666666667, ans=0.0 2023-10-09 16:55:58,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-10-09 16:56:40,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=72076.66666666667, ans=0.125 2023-10-09 16:56:42,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=72076.66666666667, ans=0.125 2023-10-09 16:56:50,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=72123.33333333333, ans=0.04949747468305833 2023-10-09 16:57:00,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=72170.0, ans=0.125 2023-10-09 16:57:02,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=72170.0, ans=0.125 2023-10-09 16:57:06,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=72216.66666666667, ans=0.125 2023-10-09 16:57:34,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=72310.0, ans=0.125 2023-10-09 16:57:34,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=72310.0, ans=0.0 2023-10-09 16:57:50,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.312e+02 2.668e+02 3.373e+02 4.905e+02, threshold=5.336e+02, percent-clipped=0.0 2023-10-09 16:57:51,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=72356.66666666667, ans=0.125 2023-10-09 16:57:51,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=15.0 2023-10-09 16:58:02,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-10-09 16:58:12,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=72450.0, ans=0.2 2023-10-09 16:58:14,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=72496.66666666667, ans=0.125 2023-10-09 16:58:19,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=72496.66666666667, ans=0.0 2023-10-09 16:58:23,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=72496.66666666667, ans=0.0 2023-10-09 16:58:42,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72590.0, ans=0.1 2023-10-09 16:59:03,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=72636.66666666667, ans=0.125 2023-10-09 16:59:28,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=72730.0, ans=0.125 2023-10-09 16:59:53,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=72823.33333333333, ans=0.125 2023-10-09 16:59:54,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=72823.33333333333, ans=0.125 2023-10-09 16:59:58,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.198e+02 2.537e+02 2.863e+02 4.831e+02, threshold=5.075e+02, percent-clipped=0.0 2023-10-09 17:00:08,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=72870.0, ans=0.0 2023-10-09 17:00:33,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-10-09 17:00:56,113 INFO [train.py:1031] (3/4) Epoch 2, batch 2000, loss[loss=0.35, simple_loss=0.4033, pruned_loss=0.1483, over 16653.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3771, pruned_loss=0.126, over 20781500.52 frames. ], batch size: 241, lr: 2.49e-02, grad_scale: 32.0 2023-10-09 17:01:21,661 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:01:27,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=73150.0, ans=0.2 2023-10-09 17:01:29,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.41 vs. limit=22.5 2023-10-09 17:01:31,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73150.0, ans=0.1 2023-10-09 17:01:49,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-10-09 17:02:00,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=73243.33333333333, ans=0.02 2023-10-09 17:02:17,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 2.298e+02 2.585e+02 2.929e+02 3.982e+02, threshold=5.170e+02, percent-clipped=0.0 2023-10-09 17:02:18,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73290.0, ans=0.1 2023-10-09 17:02:30,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73336.66666666667, ans=0.1 2023-10-09 17:02:31,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.51 vs. limit=15.0 2023-10-09 17:02:35,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=73383.33333333333, ans=0.125 2023-10-09 17:02:40,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=73383.33333333333, ans=0.125 2023-10-09 17:02:44,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.86 vs. limit=10.0 2023-10-09 17:04:03,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=73616.66666666667, ans=0.035 2023-10-09 17:04:03,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=73616.66666666667, ans=0.1 2023-10-09 17:04:41,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.164e+02 2.490e+02 2.805e+02 6.104e+02, threshold=4.980e+02, percent-clipped=1.0 2023-10-09 17:05:02,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73850.0, ans=0.1 2023-10-09 17:05:17,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-10-09 17:05:37,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73990.0, ans=0.1 2023-10-09 17:05:56,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=74083.33333333333, ans=0.125 2023-10-09 17:06:21,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=74176.66666666667, ans=0.05 2023-10-09 17:06:35,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=74223.33333333333, ans=0.125 2023-10-09 17:06:38,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 2.202e+02 2.618e+02 3.068e+02 5.008e+02, threshold=5.236e+02, percent-clipped=1.0 2023-10-09 17:06:41,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74270.0, ans=0.125 2023-10-09 17:06:54,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=74316.66666666667, ans=0.125 2023-10-09 17:07:31,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=74456.66666666667, ans=0.2 2023-10-09 17:07:48,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=12.0 2023-10-09 17:08:00,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=74550.0, ans=0.1 2023-10-09 17:08:12,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74596.66666666667, ans=0.1 2023-10-09 17:08:23,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=74643.33333333333, ans=0.1 2023-10-09 17:08:35,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=74690.0, ans=0.2 2023-10-09 17:08:35,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.366e+02 2.689e+02 3.010e+02 5.051e+02, threshold=5.377e+02, percent-clipped=0.0 2023-10-09 17:08:40,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=74736.66666666667, ans=0.0 2023-10-09 17:09:09,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=74830.0, ans=0.1 2023-10-09 17:09:23,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74923.33333333333, ans=0.1 2023-10-09 17:09:27,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=74923.33333333333, ans=0.05 2023-10-09 17:09:39,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=74970.0, ans=0.125 2023-10-09 17:09:44,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=75016.66666666667, ans=0.2 2023-10-09 17:09:44,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=75016.66666666667, ans=0.125 2023-10-09 17:09:55,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=75063.33333333333, ans=0.125 2023-10-09 17:10:21,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 2.224e+02 2.662e+02 3.133e+02 4.356e+02, threshold=5.325e+02, percent-clipped=0.0 2023-10-09 17:10:36,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=75250.0, ans=0.125 2023-10-09 17:10:56,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.55 vs. limit=15.0 2023-10-09 17:10:57,544 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:11:05,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=75343.33333333333, ans=0.125 2023-10-09 17:11:06,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=75390.0, ans=0.125 2023-10-09 17:11:07,056 INFO [train.py:1031] (3/4) Epoch 2, batch 2500, loss[loss=0.3225, simple_loss=0.3833, pruned_loss=0.1309, over 16816.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.377, pruned_loss=0.126, over 23443281.05 frames. ], batch size: 188, lr: 2.46e-02, grad_scale: 32.0 2023-10-09 17:11:10,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.65 vs. limit=10.0 2023-10-09 17:11:41,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=75483.33333333333, ans=0.125 2023-10-09 17:11:57,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=75576.66666666667, ans=0.125 2023-10-09 17:11:58,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2023-10-09 17:12:00,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75576.66666666667, ans=0.125 2023-10-09 17:12:09,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=75623.33333333333, ans=0.0 2023-10-09 17:12:11,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 2.126e+02 2.552e+02 3.063e+02 4.345e+02, threshold=5.104e+02, percent-clipped=0.0 2023-10-09 17:12:22,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.51 vs. limit=10.0 2023-10-09 17:12:43,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=75763.33333333333, ans=0.09899494936611666 2023-10-09 17:12:53,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=75810.0, ans=0.125 2023-10-09 17:12:53,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=75810.0, ans=0.125 2023-10-09 17:12:54,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75810.0, ans=0.125 2023-10-09 17:12:59,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=75856.66666666667, ans=0.0 2023-10-09 17:13:08,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=75903.33333333333, ans=0.0 2023-10-09 17:13:11,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=75903.33333333333, ans=0.0 2023-10-09 17:13:39,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75996.66666666667, ans=0.1 2023-10-09 17:13:45,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=76043.33333333333, ans=0.125 2023-10-09 17:14:00,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.14 vs. limit=22.5 2023-10-09 17:14:00,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.136e+02 2.326e+02 2.724e+02 4.312e+02, threshold=4.651e+02, percent-clipped=0.0 2023-10-09 17:14:19,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=76183.33333333333, ans=0.0 2023-10-09 17:14:28,225 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.04 vs. limit=10.0 2023-10-09 17:15:58,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=76510.0, ans=0.125 2023-10-09 17:16:16,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 2.021e+02 2.381e+02 2.875e+02 3.730e+02, threshold=4.762e+02, percent-clipped=0.0 2023-10-09 17:16:18,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76603.33333333333, ans=0.1 2023-10-09 17:16:43,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=76696.66666666667, ans=0.125 2023-10-09 17:16:55,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=76743.33333333333, ans=0.0 2023-10-09 17:16:57,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=76743.33333333333, ans=0.125 2023-10-09 17:17:12,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=76790.0, ans=0.125 2023-10-09 17:17:19,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=15.0 2023-10-09 17:17:24,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=76836.66666666667, ans=0.125 2023-10-09 17:17:40,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76883.33333333333, ans=0.125 2023-10-09 17:17:43,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=76930.0, ans=0.0 2023-10-09 17:17:45,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76930.0, ans=0.125 2023-10-09 17:17:47,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=76930.0, ans=0.0 2023-10-09 17:17:52,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=76930.0, ans=0.125 2023-10-09 17:17:53,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=76930.0, ans=0.1 2023-10-09 17:18:07,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=76976.66666666667, ans=0.125 2023-10-09 17:18:13,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=77023.33333333333, ans=0.125 2023-10-09 17:18:14,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.85 vs. limit=15.0 2023-10-09 17:18:19,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.317e+02 2.669e+02 3.224e+02 6.283e+02, threshold=5.337e+02, percent-clipped=2.0 2023-10-09 17:18:41,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=77116.66666666667, ans=0.2 2023-10-09 17:19:22,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=77256.66666666667, ans=0.0 2023-10-09 17:19:27,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.32 vs. limit=22.5 2023-10-09 17:19:29,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=77303.33333333333, ans=0.125 2023-10-09 17:19:48,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77396.66666666667, ans=0.1 2023-10-09 17:20:15,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-10-09 17:20:19,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 2.193e+02 2.477e+02 2.782e+02 5.717e+02, threshold=4.954e+02, percent-clipped=1.0 2023-10-09 17:20:20,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2023-10-09 17:20:22,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=77536.66666666667, ans=0.5 2023-10-09 17:20:25,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=77536.66666666667, ans=0.2 2023-10-09 17:20:26,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=15.0 2023-10-09 17:20:48,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=77630.0, ans=0.125 2023-10-09 17:20:57,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=15.0 2023-10-09 17:21:00,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=77676.66666666667, ans=0.125 2023-10-09 17:21:05,847 INFO [train.py:1031] (3/4) Epoch 2, batch 3000, loss[loss=0.3073, simple_loss=0.374, pruned_loss=0.1204, over 16941.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3752, pruned_loss=0.1251, over 25519641.04 frames. ], batch size: 123, lr: 2.42e-02, grad_scale: 32.0 2023-10-09 17:21:16,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2023-10-09 17:21:20,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=77770.0, ans=0.2 2023-10-09 17:21:26,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=77816.66666666667, ans=0.0 2023-10-09 17:21:31,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2023-10-09 17:21:34,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=77816.66666666667, ans=0.07 2023-10-09 17:21:42,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=15.0 2023-10-09 17:21:44,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=77863.33333333333, ans=0.125 2023-10-09 17:21:59,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=77956.66666666667, ans=0.0 2023-10-09 17:22:10,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=77956.66666666667, ans=0.0 2023-10-09 17:22:10,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.080e+02 2.421e+02 2.879e+02 4.685e+02, threshold=4.842e+02, percent-clipped=0.0 2023-10-09 17:22:21,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-10-09 17:22:25,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=78050.0, ans=0.5 2023-10-09 17:22:53,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.07 vs. limit=22.5 2023-10-09 17:23:20,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=78236.66666666667, ans=0.0 2023-10-09 17:23:35,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=78330.0, ans=10.0 2023-10-09 17:23:39,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=78330.0, ans=0.125 2023-10-09 17:23:47,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.23 vs. limit=15.0 2023-10-09 17:23:57,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-10-09 17:23:58,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=78423.33333333333, ans=10.0 2023-10-09 17:24:03,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.169e+02 2.375e+02 2.769e+02 3.890e+02, threshold=4.750e+02, percent-clipped=0.0 2023-10-09 17:24:23,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78516.66666666667, ans=0.125 2023-10-09 17:24:50,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=78656.66666666667, ans=6.0 2023-10-09 17:24:55,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=78656.66666666667, ans=0.0 2023-10-09 17:25:07,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=78703.33333333333, ans=0.125 2023-10-09 17:25:12,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=78703.33333333333, ans=0.125 2023-10-09 17:25:12,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.91 vs. limit=6.0 2023-10-09 17:25:15,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78750.0, ans=0.0 2023-10-09 17:25:27,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=78796.66666666667, ans=0.125 2023-10-09 17:25:43,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=6.0 2023-10-09 17:25:43,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=78843.33333333333, ans=6.0 2023-10-09 17:25:56,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=78890.0, ans=0.125 2023-10-09 17:26:05,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 2.124e+02 2.496e+02 2.948e+02 4.840e+02, threshold=4.992e+02, percent-clipped=1.0 2023-10-09 17:26:23,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=78983.33333333333, ans=0.125 2023-10-09 17:26:39,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=79030.0, ans=0.0 2023-10-09 17:26:42,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=79076.66666666667, ans=0.125 2023-10-09 17:26:45,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=79076.66666666667, ans=0.0 2023-10-09 17:26:55,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=79123.33333333333, ans=0.125 2023-10-09 17:27:03,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.09 vs. limit=10.0 2023-10-09 17:27:13,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=79170.0, ans=0.125 2023-10-09 17:27:16,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79216.66666666667, ans=0.125 2023-10-09 17:27:30,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79263.33333333333, ans=0.1 2023-10-09 17:27:33,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=79263.33333333333, ans=0.0 2023-10-09 17:27:38,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=79310.0, ans=0.0 2023-10-09 17:27:45,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2023-10-09 17:27:55,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 2.132e+02 2.448e+02 2.724e+02 4.044e+02, threshold=4.896e+02, percent-clipped=0.0 2023-10-09 17:28:16,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=79450.0, ans=0.125 2023-10-09 17:28:17,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79450.0, ans=0.1 2023-10-09 17:28:33,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79496.66666666667, ans=0.125 2023-10-09 17:28:42,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=79543.33333333333, ans=0.125 2023-10-09 17:29:01,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79636.66666666667, ans=0.125 2023-10-09 17:29:03,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79636.66666666667, ans=0.125 2023-10-09 17:29:04,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=79636.66666666667, ans=0.0 2023-10-09 17:29:13,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=79683.33333333333, ans=0.0 2023-10-09 17:29:26,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79730.0, ans=0.1 2023-10-09 17:29:34,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79776.66666666667, ans=0.125 2023-10-09 17:29:34,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=79776.66666666667, ans=0.5 2023-10-09 17:29:50,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=79823.33333333333, ans=0.125 2023-10-09 17:29:52,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 2.122e+02 2.519e+02 2.921e+02 4.332e+02, threshold=5.037e+02, percent-clipped=0.0 2023-10-09 17:29:58,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=79870.0, ans=10.0 2023-10-09 17:30:40,253 INFO [train.py:1031] (3/4) Epoch 2, batch 3500, loss[loss=0.2984, simple_loss=0.3607, pruned_loss=0.118, over 16848.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3744, pruned_loss=0.1247, over 27121118.72 frames. ], batch size: 155, lr: 2.39e-02, grad_scale: 32.0 2023-10-09 17:30:44,503 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:30:56,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=15.0 2023-10-09 17:31:02,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80150.0, ans=0.1 2023-10-09 17:31:11,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=80150.0, ans=12.0 2023-10-09 17:31:12,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=80150.0, ans=0.125 2023-10-09 17:31:13,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2023-10-09 17:31:37,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.92 vs. limit=22.5 2023-10-09 17:31:38,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=80290.0, ans=0.125 2023-10-09 17:31:46,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.213e+02 2.449e+02 2.841e+02 4.307e+02, threshold=4.898e+02, percent-clipped=0.0 2023-10-09 17:31:50,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=80336.66666666667, ans=10.0 2023-10-09 17:31:51,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=80336.66666666667, ans=0.125 2023-10-09 17:31:56,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=80336.66666666667, ans=0.125 2023-10-09 17:32:04,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=80383.33333333333, ans=0.0 2023-10-09 17:32:35,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=80476.66666666667, ans=0.2 2023-10-09 17:32:50,403 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-10-09 17:33:05,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=80616.66666666667, ans=0.125 2023-10-09 17:33:10,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.00 vs. limit=22.5 2023-10-09 17:33:38,047 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:33:46,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.111e+02 2.346e+02 2.701e+02 4.177e+02, threshold=4.692e+02, percent-clipped=0.0 2023-10-09 17:33:54,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=80803.33333333333, ans=0.1 2023-10-09 17:34:07,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80850.0, ans=0.1 2023-10-09 17:34:12,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=80896.66666666667, ans=0.125 2023-10-09 17:34:33,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=80990.0, ans=0.09899494936611666 2023-10-09 17:34:33,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.39 vs. limit=6.0 2023-10-09 17:35:03,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81083.33333333333, ans=0.1 2023-10-09 17:35:06,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=81083.33333333333, ans=0.0 2023-10-09 17:35:11,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=81130.0, ans=0.0 2023-10-09 17:35:19,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81130.0, ans=0.125 2023-10-09 17:35:37,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.00 vs. limit=15.0 2023-10-09 17:35:44,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.023e+02 2.283e+02 2.727e+02 4.343e+02, threshold=4.565e+02, percent-clipped=0.0 2023-10-09 17:35:48,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.35 vs. limit=15.0 2023-10-09 17:35:55,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=81270.0, ans=0.125 2023-10-09 17:35:56,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=81316.66666666667, ans=0.125 2023-10-09 17:36:14,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=81363.33333333333, ans=0.125 2023-10-09 17:36:24,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81410.0, ans=0.1 2023-10-09 17:36:30,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=81410.0, ans=0.125 2023-10-09 17:36:31,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-10-09 17:36:51,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=81503.33333333333, ans=0.0 2023-10-09 17:36:56,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=81550.0, ans=0.0 2023-10-09 17:37:00,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81550.0, ans=0.1 2023-10-09 17:37:38,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.095e+02 2.292e+02 2.573e+02 3.685e+02, threshold=4.584e+02, percent-clipped=0.0 2023-10-09 17:37:48,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=81736.66666666667, ans=0.1 2023-10-09 17:37:52,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=81783.33333333333, ans=0.125 2023-10-09 17:38:12,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81876.66666666667, ans=0.125 2023-10-09 17:38:19,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=81876.66666666667, ans=0.125 2023-10-09 17:38:28,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.41 vs. limit=15.0 2023-10-09 17:38:29,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=81923.33333333333, ans=0.125 2023-10-09 17:38:46,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=81970.0, ans=0.0 2023-10-09 17:39:03,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=82063.33333333333, ans=0.1 2023-10-09 17:39:05,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-10-09 17:39:10,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2023-10-09 17:39:28,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=82156.66666666667, ans=0.0 2023-10-09 17:39:29,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-10-09 17:39:30,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 2.084e+02 2.510e+02 2.904e+02 5.296e+02, threshold=5.020e+02, percent-clipped=4.0 2023-10-09 17:39:34,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.86 vs. limit=10.0 2023-10-09 17:39:57,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.93 vs. limit=15.0 2023-10-09 17:40:18,355 INFO [train.py:1031] (3/4) Epoch 2, batch 4000, loss[loss=0.3426, simple_loss=0.4053, pruned_loss=0.14, over 16524.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3727, pruned_loss=0.1233, over 28401057.96 frames. ], batch size: 266, lr: 2.37e-02, grad_scale: 32.0 2023-10-09 17:40:22,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.50 vs. limit=15.0 2023-10-09 17:40:23,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=82390.0, ans=0.0 2023-10-09 17:40:27,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=82390.0, ans=0.0 2023-10-09 17:40:46,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82483.33333333333, ans=0.1 2023-10-09 17:41:00,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.33 vs. limit=22.5 2023-10-09 17:41:17,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=82623.33333333333, ans=0.125 2023-10-09 17:41:24,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 2.183e+02 2.508e+02 2.978e+02 4.155e+02, threshold=5.017e+02, percent-clipped=0.0 2023-10-09 17:41:31,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=82670.0, ans=0.0 2023-10-09 17:41:47,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=82716.66666666667, ans=0.125 2023-10-09 17:41:58,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=82763.33333333333, ans=0.2 2023-10-09 17:41:58,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=82763.33333333333, ans=0.0 2023-10-09 17:42:07,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-09 17:42:15,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=82856.66666666667, ans=0.125 2023-10-09 17:42:16,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=82856.66666666667, ans=0.0 2023-10-09 17:42:22,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=82903.33333333333, ans=0.125 2023-10-09 17:42:24,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=82903.33333333333, ans=0.125 2023-10-09 17:42:28,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=6.0 2023-10-09 17:42:34,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=82950.0, ans=10.0 2023-10-09 17:42:38,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82950.0, ans=0.1 2023-10-09 17:42:39,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=82950.0, ans=0.0 2023-10-09 17:42:42,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=82950.0, ans=0.125 2023-10-09 17:42:43,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=82950.0, ans=0.125 2023-10-09 17:42:55,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.28 vs. limit=15.0 2023-10-09 17:43:04,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=83043.33333333333, ans=0.125 2023-10-09 17:43:10,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=83090.0, ans=0.2 2023-10-09 17:43:16,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.38 vs. limit=6.0 2023-10-09 17:43:17,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.276e+02 2.662e+02 3.204e+02 4.781e+02, threshold=5.324e+02, percent-clipped=0.0 2023-10-09 17:43:18,071 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-09 17:43:18,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=83136.66666666667, ans=0.1 2023-10-09 17:43:25,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=83136.66666666667, ans=0.5 2023-10-09 17:43:31,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=83136.66666666667, ans=0.125 2023-10-09 17:43:32,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=22.5 2023-10-09 17:43:47,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=83183.33333333333, ans=0.125 2023-10-09 17:43:50,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.87 vs. limit=10.0 2023-10-09 17:43:59,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=83230.0, ans=0.125 2023-10-09 17:44:27,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=83323.33333333333, ans=0.125 2023-10-09 17:44:52,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83416.66666666667, ans=0.1 2023-10-09 17:45:20,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=83510.0, ans=0.2 2023-10-09 17:45:28,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=15.0 2023-10-09 17:45:37,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 2.118e+02 2.537e+02 2.829e+02 3.904e+02, threshold=5.074e+02, percent-clipped=0.0 2023-10-09 17:45:37,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=83556.66666666667, ans=0.125 2023-10-09 17:45:44,018 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:46:09,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=83696.66666666667, ans=0.2 2023-10-09 17:46:12,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=83743.33333333333, ans=0.09899494936611666 2023-10-09 17:46:37,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.17 vs. limit=15.0 2023-10-09 17:46:41,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.56 vs. limit=15.0 2023-10-09 17:46:43,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=83836.66666666667, ans=0.125 2023-10-09 17:46:50,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=83883.33333333333, ans=0.2 2023-10-09 17:47:02,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.39 vs. limit=10.0 2023-10-09 17:47:07,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=83930.0, ans=0.125 2023-10-09 17:47:29,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.163e+02 2.368e+02 2.753e+02 3.959e+02, threshold=4.737e+02, percent-clipped=0.0 2023-10-09 17:47:33,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84070.0, ans=0.1 2023-10-09 17:47:50,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=84116.66666666667, ans=0.125 2023-10-09 17:48:18,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.06 vs. limit=12.0 2023-10-09 17:48:20,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=84256.66666666667, ans=0.125 2023-10-09 17:48:47,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=84350.0, ans=0.125 2023-10-09 17:48:53,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=84396.66666666667, ans=0.125 2023-10-09 17:48:54,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=84396.66666666667, ans=0.125 2023-10-09 17:48:57,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84396.66666666667, ans=0.1 2023-10-09 17:49:06,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=84443.33333333333, ans=0.0 2023-10-09 17:49:33,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.247e+02 2.592e+02 2.795e+02 4.563e+02, threshold=5.184e+02, percent-clipped=0.0 2023-10-09 17:49:35,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=84536.66666666667, ans=0.125 2023-10-09 17:49:42,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=84536.66666666667, ans=0.0 2023-10-09 17:49:42,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=84536.66666666667, ans=0.0 2023-10-09 17:49:46,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=84583.33333333333, ans=0.125 2023-10-09 17:49:49,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=84583.33333333333, ans=0.0 2023-10-09 17:49:57,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=84583.33333333333, ans=0.125 2023-10-09 17:50:07,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=84630.0, ans=0.2 2023-10-09 17:50:11,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.76 vs. limit=10.0 2023-10-09 17:50:12,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-10-09 17:50:18,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=84676.66666666667, ans=0.125 2023-10-09 17:50:23,712 INFO [train.py:1031] (3/4) Epoch 2, batch 4500, loss[loss=0.2719, simple_loss=0.3488, pruned_loss=0.09752, over 16835.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3722, pruned_loss=0.1227, over 29362206.39 frames. ], batch size: 98, lr: 2.34e-02, grad_scale: 32.0 2023-10-09 17:50:43,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=84770.0, ans=0.1 2023-10-09 17:50:45,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=84816.66666666667, ans=0.125 2023-10-09 17:51:18,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=84910.0, ans=0.5 2023-10-09 17:51:25,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=84956.66666666667, ans=0.0 2023-10-09 17:51:29,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.981e+02 2.305e+02 2.900e+02 5.422e+02, threshold=4.610e+02, percent-clipped=3.0 2023-10-09 17:51:50,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85050.0, ans=0.1 2023-10-09 17:52:05,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85143.33333333333, ans=0.1 2023-10-09 17:52:16,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=85190.0, ans=0.125 2023-10-09 17:52:49,844 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:52:52,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=85330.0, ans=0.125 2023-10-09 17:53:12,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=85423.33333333333, ans=0.125 2023-10-09 17:53:15,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.103e+02 2.328e+02 2.713e+02 4.473e+02, threshold=4.656e+02, percent-clipped=0.0 2023-10-09 17:53:21,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-09 17:54:10,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2023-10-09 17:54:22,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.03 vs. limit=15.0 2023-10-09 17:54:34,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=85750.0, ans=0.0 2023-10-09 17:55:01,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=85843.33333333333, ans=0.125 2023-10-09 17:55:10,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 2.318e+02 2.569e+02 3.235e+02 4.756e+02, threshold=5.137e+02, percent-clipped=1.0 2023-10-09 17:55:12,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85936.66666666667, ans=0.0 2023-10-09 17:55:23,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=85983.33333333333, ans=0.2 2023-10-09 17:55:28,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=85983.33333333333, ans=0.0 2023-10-09 17:55:29,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=85983.33333333333, ans=0.125 2023-10-09 17:55:34,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=86030.0, ans=0.125 2023-10-09 17:55:35,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=86030.0, ans=0.05 2023-10-09 17:55:38,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2023-10-09 17:55:51,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=86076.66666666667, ans=0.0 2023-10-09 17:55:55,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=86123.33333333333, ans=0.125 2023-10-09 17:55:55,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=22.5 2023-10-09 17:55:57,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=86123.33333333333, ans=0.05 2023-10-09 17:56:06,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86170.0, ans=0.1 2023-10-09 17:56:23,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=86216.66666666667, ans=0.125 2023-10-09 17:56:47,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=86310.0, ans=0.0 2023-10-09 17:56:47,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=86310.0, ans=0.125 2023-10-09 17:56:53,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=86356.66666666667, ans=0.125 2023-10-09 17:56:54,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=86356.66666666667, ans=0.125 2023-10-09 17:56:54,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-10-09 17:57:01,208 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=15.0 2023-10-09 17:57:02,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=86356.66666666667, ans=0.125 2023-10-09 17:57:02,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.250e+02 2.637e+02 3.037e+02 4.822e+02, threshold=5.274e+02, percent-clipped=0.0 2023-10-09 17:57:25,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.01 vs. limit=15.0 2023-10-09 17:58:00,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=86636.66666666667, ans=0.125 2023-10-09 17:58:04,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=86636.66666666667, ans=0.125 2023-10-09 17:58:08,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=86636.66666666667, ans=0.0 2023-10-09 17:58:13,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=86683.33333333333, ans=0.125 2023-10-09 17:58:22,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=86683.33333333333, ans=0.125 2023-10-09 17:58:22,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=86683.33333333333, ans=0.125 2023-10-09 17:58:33,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=86730.0, ans=0.0 2023-10-09 17:58:57,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.002e+02 2.329e+02 2.614e+02 3.537e+02, threshold=4.658e+02, percent-clipped=0.0 2023-10-09 17:59:02,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=86870.0, ans=0.2 2023-10-09 17:59:06,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2023-10-09 17:59:13,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=86916.66666666667, ans=0.125 2023-10-09 17:59:21,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=86963.33333333333, ans=0.0 2023-10-09 17:59:22,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=86963.33333333333, ans=10.0 2023-10-09 17:59:26,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86963.33333333333, ans=0.1 2023-10-09 17:59:28,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86963.33333333333, ans=0.125 2023-10-09 17:59:30,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=86963.33333333333, ans=0.125 2023-10-09 17:59:41,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=87010.0, ans=0.5 2023-10-09 17:59:44,664 INFO [train.py:1031] (3/4) Epoch 2, batch 5000, loss[loss=0.3307, simple_loss=0.3866, pruned_loss=0.1373, over 15409.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3715, pruned_loss=0.1224, over 30127285.11 frames. ], batch size: 35, lr: 2.31e-02, grad_scale: 32.0 2023-10-09 17:59:52,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=87056.66666666667, ans=0.09899494936611666 2023-10-09 17:59:57,069 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=15.0 2023-10-09 18:00:05,578 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:00:22,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=87196.66666666667, ans=0.125 2023-10-09 18:00:22,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=87196.66666666667, ans=0.05 2023-10-09 18:00:40,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=87290.0, ans=0.125 2023-10-09 18:00:44,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=87290.0, ans=0.0 2023-10-09 18:00:51,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.135e+02 2.489e+02 2.870e+02 4.390e+02, threshold=4.978e+02, percent-clipped=0.0 2023-10-09 18:00:51,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-10-09 18:00:57,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.72 vs. limit=22.5 2023-10-09 18:00:58,532 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-09 18:01:05,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.05 vs. limit=15.0 2023-10-09 18:01:07,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=87383.33333333333, ans=10.0 2023-10-09 18:01:12,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=87383.33333333333, ans=0.0 2023-10-09 18:01:24,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-10-09 18:01:47,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=87523.33333333333, ans=0.0 2023-10-09 18:02:17,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=87663.33333333333, ans=0.0 2023-10-09 18:02:17,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.51 vs. limit=15.0 2023-10-09 18:02:18,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=87663.33333333333, ans=0.0 2023-10-09 18:02:19,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=87663.33333333333, ans=0.125 2023-10-09 18:02:33,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.77 vs. limit=22.5 2023-10-09 18:02:40,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=87756.66666666667, ans=0.5 2023-10-09 18:02:49,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.067e+02 2.325e+02 2.720e+02 4.062e+02, threshold=4.650e+02, percent-clipped=0.0 2023-10-09 18:03:28,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=22.5 2023-10-09 18:03:32,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=87943.33333333333, ans=0.0 2023-10-09 18:04:01,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2023-10-09 18:04:02,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=88083.33333333333, ans=0.0 2023-10-09 18:04:40,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 2.120e+02 2.354e+02 2.833e+02 4.221e+02, threshold=4.708e+02, percent-clipped=0.0 2023-10-09 18:05:01,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=88316.66666666667, ans=0.2 2023-10-09 18:05:15,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88363.33333333333, ans=0.125 2023-10-09 18:05:17,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=88363.33333333333, ans=0.035 2023-10-09 18:05:19,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=88410.0, ans=0.125 2023-10-09 18:05:23,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=88410.0, ans=0.0 2023-10-09 18:05:32,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=88410.0, ans=0.125 2023-10-09 18:05:41,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=88456.66666666667, ans=0.125 2023-10-09 18:06:19,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.46 vs. limit=15.0 2023-10-09 18:06:21,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=88596.66666666667, ans=0.125 2023-10-09 18:06:43,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=88690.0, ans=0.025 2023-10-09 18:06:49,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 2.019e+02 2.227e+02 2.783e+02 4.113e+02, threshold=4.454e+02, percent-clipped=0.0 2023-10-09 18:06:52,121 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:07:05,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=88783.33333333333, ans=0.0 2023-10-09 18:07:15,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=88830.0, ans=0.07 2023-10-09 18:07:28,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=88876.66666666667, ans=0.05 2023-10-09 18:07:58,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=88970.0, ans=0.125 2023-10-09 18:08:11,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=89063.33333333333, ans=0.2 2023-10-09 18:08:15,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.21 vs. limit=22.5 2023-10-09 18:08:23,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.87 vs. limit=22.5 2023-10-09 18:08:25,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=89110.0, ans=0.125 2023-10-09 18:08:36,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=89156.66666666667, ans=0.0 2023-10-09 18:08:46,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 2.064e+02 2.291e+02 2.713e+02 5.102e+02, threshold=4.582e+02, percent-clipped=1.0 2023-10-09 18:08:51,226 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:08:59,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=89203.33333333333, ans=0.0 2023-10-09 18:09:21,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-10-09 18:09:36,434 INFO [train.py:1031] (3/4) Epoch 2, batch 5500, loss[loss=0.303, simple_loss=0.3588, pruned_loss=0.1236, over 16701.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3705, pruned_loss=0.1216, over 30721871.58 frames. ], batch size: 61, lr: 2.28e-02, grad_scale: 32.0 2023-10-09 18:09:38,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=89390.0, ans=0.0 2023-10-09 18:09:40,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=89390.0, ans=0.2 2023-10-09 18:09:54,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=89436.66666666667, ans=0.0 2023-10-09 18:10:00,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=89483.33333333333, ans=0.125 2023-10-09 18:10:00,216 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:10:06,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=89530.0, ans=0.125 2023-10-09 18:10:31,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-10-09 18:10:37,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 2.043e+02 2.375e+02 2.896e+02 4.278e+02, threshold=4.750e+02, percent-clipped=0.0 2023-10-09 18:10:38,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=89670.0, ans=0.0 2023-10-09 18:10:46,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=89670.0, ans=0.125 2023-10-09 18:11:29,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=89856.66666666667, ans=0.125 2023-10-09 18:11:31,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=89856.66666666667, ans=0.125 2023-10-09 18:12:09,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=90043.33333333333, ans=0.125 2023-10-09 18:12:21,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=90090.0, ans=0.1 2023-10-09 18:12:26,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 2.222e+02 2.575e+02 3.125e+02 5.556e+02, threshold=5.150e+02, percent-clipped=3.0 2023-10-09 18:12:39,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=90183.33333333333, ans=0.1 2023-10-09 18:12:45,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=90183.33333333333, ans=0.0 2023-10-09 18:12:49,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90183.33333333333, ans=0.1 2023-10-09 18:12:49,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=90183.33333333333, ans=0.125 2023-10-09 18:13:24,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.05 vs. limit=5.0 2023-10-09 18:13:24,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90370.0, ans=0.1 2023-10-09 18:13:40,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=90416.66666666667, ans=0.07 2023-10-09 18:14:03,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=90510.0, ans=0.0 2023-10-09 18:14:12,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.06 vs. limit=22.5 2023-10-09 18:14:16,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=90556.66666666667, ans=0.5 2023-10-09 18:14:22,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.220e+02 2.533e+02 2.773e+02 4.529e+02, threshold=5.067e+02, percent-clipped=0.0 2023-10-09 18:14:33,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=90650.0, ans=0.0 2023-10-09 18:14:38,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=90650.0, ans=0.2 2023-10-09 18:14:39,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=90650.0, ans=10.0 2023-10-09 18:14:55,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=90696.66666666667, ans=0.2 2023-10-09 18:15:01,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90743.33333333333, ans=0.1 2023-10-09 18:15:01,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.60 vs. limit=15.0 2023-10-09 18:15:11,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=90790.0, ans=0.07 2023-10-09 18:15:11,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=90790.0, ans=0.125 2023-10-09 18:15:29,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=90836.66666666667, ans=0.0 2023-10-09 18:15:29,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=90836.66666666667, ans=0.125 2023-10-09 18:15:41,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=90883.33333333333, ans=0.125 2023-10-09 18:15:44,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=8.0 2023-10-09 18:15:45,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=90930.0, ans=0.1 2023-10-09 18:15:46,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-10-09 18:15:50,564 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=15.0 2023-10-09 18:15:56,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.74 vs. limit=10.0 2023-10-09 18:16:05,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=91023.33333333333, ans=0.125 2023-10-09 18:16:13,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-10-09 18:16:15,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 2.025e+02 2.330e+02 2.746e+02 4.329e+02, threshold=4.659e+02, percent-clipped=0.0 2023-10-09 18:16:24,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=91070.0, ans=0.2 2023-10-09 18:16:38,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-10-09 18:16:48,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=91163.33333333333, ans=0.125 2023-10-09 18:16:53,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=91210.0, ans=0.125 2023-10-09 18:17:48,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=91396.66666666667, ans=0.1 2023-10-09 18:17:54,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=91443.33333333333, ans=0.125 2023-10-09 18:18:02,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=91443.33333333333, ans=0.125 2023-10-09 18:18:15,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.119e+02 2.482e+02 2.997e+02 4.498e+02, threshold=4.963e+02, percent-clipped=0.0 2023-10-09 18:18:18,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=91536.66666666667, ans=0.0 2023-10-09 18:18:24,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=91536.66666666667, ans=0.0 2023-10-09 18:18:49,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91676.66666666667, ans=0.0 2023-10-09 18:19:00,574 INFO [train.py:1031] (3/4) Epoch 2, batch 6000, loss[loss=0.3067, simple_loss=0.3745, pruned_loss=0.1194, over 16849.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3701, pruned_loss=0.1213, over 31191908.69 frames. ], batch size: 98, lr: 2.26e-02, grad_scale: 32.0 2023-10-09 18:19:06,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.79 vs. limit=10.0 2023-10-09 18:19:38,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=91863.33333333333, ans=0.09899494936611666 2023-10-09 18:19:57,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=91910.0, ans=0.125 2023-10-09 18:20:09,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.201e+02 2.469e+02 2.948e+02 4.395e+02, threshold=4.938e+02, percent-clipped=0.0 2023-10-09 18:20:11,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=22.5 2023-10-09 18:20:58,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=92190.0, ans=0.125 2023-10-09 18:21:03,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=92190.0, ans=0.125 2023-10-09 18:21:14,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=92236.66666666667, ans=0.0 2023-10-09 18:21:22,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=92283.33333333333, ans=0.0 2023-10-09 18:21:31,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-10-09 18:21:53,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92423.33333333333, ans=0.1 2023-10-09 18:21:59,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=92423.33333333333, ans=0.125 2023-10-09 18:22:05,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 2.170e+02 2.435e+02 2.810e+02 4.259e+02, threshold=4.871e+02, percent-clipped=0.0 2023-10-09 18:22:44,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=92610.0, ans=0.125 2023-10-09 18:22:54,402 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.516e-03 2023-10-09 18:23:03,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=92656.66666666667, ans=0.2 2023-10-09 18:23:29,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=92750.0, ans=0.125 2023-10-09 18:23:34,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=92796.66666666667, ans=0.05 2023-10-09 18:23:42,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=92796.66666666667, ans=0.125 2023-10-09 18:23:57,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=92890.0, ans=0.125 2023-10-09 18:23:59,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=92890.0, ans=0.125 2023-10-09 18:24:03,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 2.089e+02 2.305e+02 2.729e+02 3.889e+02, threshold=4.611e+02, percent-clipped=0.0 2023-10-09 18:24:35,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-10-09 18:24:52,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=93076.66666666667, ans=0.125 2023-10-09 18:24:53,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=93076.66666666667, ans=0.0 2023-10-09 18:25:34,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.56 vs. limit=22.5 2023-10-09 18:26:12,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.217e+02 2.714e+02 3.194e+02 4.455e+02, threshold=5.429e+02, percent-clipped=0.0 2023-10-09 18:26:13,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=93403.33333333333, ans=0.125 2023-10-09 18:27:14,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-10-09 18:27:21,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-10-09 18:27:30,476 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.120e-03 2023-10-09 18:27:31,433 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:27:49,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=15.0 2023-10-09 18:27:59,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93823.33333333333, ans=0.125 2023-10-09 18:28:01,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=93823.33333333333, ans=0.0 2023-10-09 18:28:10,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 2.048e+02 2.318e+02 2.828e+02 3.945e+02, threshold=4.636e+02, percent-clipped=0.0 2023-10-09 18:28:10,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=93870.0, ans=0.5 2023-10-09 18:28:15,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-10-09 18:28:26,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=93916.66666666667, ans=0.025 2023-10-09 18:28:42,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=93963.33333333333, ans=0.035 2023-10-09 18:28:43,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=93963.33333333333, ans=15.0 2023-10-09 18:28:54,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94010.0, ans=0.125 2023-10-09 18:28:58,053 INFO [train.py:1031] (3/4) Epoch 2, batch 6500, loss[loss=0.3342, simple_loss=0.394, pruned_loss=0.1372, over 16595.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3696, pruned_loss=0.1208, over 31530435.32 frames. ], batch size: 241, lr: 2.23e-02, grad_scale: 32.0 2023-10-09 18:29:11,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=94103.33333333333, ans=0.0 2023-10-09 18:29:37,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.18 vs. limit=15.0 2023-10-09 18:29:49,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=94196.66666666667, ans=0.0 2023-10-09 18:30:13,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=94290.0, ans=0.125 2023-10-09 18:30:21,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.124e+02 2.381e+02 2.650e+02 4.551e+02, threshold=4.762e+02, percent-clipped=0.0 2023-10-09 18:30:35,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=94383.33333333333, ans=0.0 2023-10-09 18:30:36,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=94383.33333333333, ans=0.2 2023-10-09 18:30:48,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94430.0, ans=0.125 2023-10-09 18:31:18,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.02 vs. limit=6.0 2023-10-09 18:32:10,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2023-10-09 18:32:15,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.35 vs. limit=15.0 2023-10-09 18:32:28,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 2.216e+02 2.410e+02 3.051e+02 5.962e+02, threshold=4.819e+02, percent-clipped=3.0 2023-10-09 18:32:30,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=94803.33333333333, ans=0.125 2023-10-09 18:32:39,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=94803.33333333333, ans=0.2 2023-10-09 18:32:59,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=94896.66666666667, ans=0.0 2023-10-09 18:33:05,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=15.0 2023-10-09 18:33:14,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.02 vs. limit=15.0 2023-10-09 18:33:19,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-10-09 18:33:32,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=95036.66666666667, ans=0.2 2023-10-09 18:33:34,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=95036.66666666667, ans=0.125 2023-10-09 18:33:55,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=95130.0, ans=0.125 2023-10-09 18:34:01,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=95176.66666666667, ans=0.125 2023-10-09 18:34:10,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95176.66666666667, ans=0.125 2023-10-09 18:34:22,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=95223.33333333333, ans=0.125 2023-10-09 18:34:25,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.045e+02 2.320e+02 2.751e+02 4.314e+02, threshold=4.640e+02, percent-clipped=0.0 2023-10-09 18:34:34,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=95270.0, ans=0.07 2023-10-09 18:34:36,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=95270.0, ans=0.125 2023-10-09 18:34:40,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2023-10-09 18:34:48,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95363.33333333333, ans=0.125 2023-10-09 18:35:37,260 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:35:54,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-10-09 18:35:57,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=95550.0, ans=0.0 2023-10-09 18:36:08,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=95596.66666666667, ans=0.0 2023-10-09 18:36:11,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=95596.66666666667, ans=10.0 2023-10-09 18:36:31,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=12.0 2023-10-09 18:36:38,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.877e+02 2.135e+02 2.470e+02 4.497e+02, threshold=4.270e+02, percent-clipped=0.0 2023-10-09 18:36:46,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=95736.66666666667, ans=0.125 2023-10-09 18:36:51,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=95736.66666666667, ans=0.125 2023-10-09 18:36:55,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=95783.33333333333, ans=0.0 2023-10-09 18:37:21,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=95876.66666666667, ans=0.0 2023-10-09 18:37:27,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=95876.66666666667, ans=0.1 2023-10-09 18:37:28,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=95876.66666666667, ans=0.0 2023-10-09 18:37:39,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=95923.33333333333, ans=0.125 2023-10-09 18:38:04,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=96016.66666666667, ans=0.125 2023-10-09 18:38:10,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=96063.33333333333, ans=0.125 2023-10-09 18:38:17,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=96063.33333333333, ans=0.125 2023-10-09 18:38:20,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=96110.0, ans=0.025 2023-10-09 18:38:23,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=96110.0, ans=0.0 2023-10-09 18:38:30,838 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:38:34,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.83 vs. limit=22.5 2023-10-09 18:38:43,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.184e+02 2.413e+02 2.807e+02 3.783e+02, threshold=4.827e+02, percent-clipped=0.0 2023-10-09 18:38:44,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=96203.33333333333, ans=0.04949747468305833 2023-10-09 18:38:47,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=22.5 2023-10-09 18:38:53,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=96250.0, ans=0.015 2023-10-09 18:38:56,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=96250.0, ans=0.125 2023-10-09 18:39:01,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.60 vs. limit=10.0 2023-10-09 18:39:19,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=96343.33333333333, ans=0.125 2023-10-09 18:39:26,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=96343.33333333333, ans=0.125 2023-10-09 18:39:28,345 INFO [train.py:1031] (3/4) Epoch 2, batch 7000, loss[loss=0.2765, simple_loss=0.3564, pruned_loss=0.0983, over 16888.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3692, pruned_loss=0.12, over 31791099.71 frames. ], batch size: 104, lr: 2.21e-02, grad_scale: 32.0 2023-10-09 18:40:15,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=96530.0, ans=0.125 2023-10-09 18:40:15,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=96530.0, ans=0.0 2023-10-09 18:40:18,295 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-10-09 18:40:27,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=96576.66666666667, ans=0.0 2023-10-09 18:40:30,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=96576.66666666667, ans=0.0 2023-10-09 18:40:45,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.087e+02 2.382e+02 2.743e+02 4.279e+02, threshold=4.765e+02, percent-clipped=0.0 2023-10-09 18:40:56,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=96716.66666666667, ans=0.0 2023-10-09 18:40:58,537 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:41:02,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.12 vs. limit=10.0 2023-10-09 18:41:06,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=96716.66666666667, ans=0.125 2023-10-09 18:41:09,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=96763.33333333333, ans=0.0 2023-10-09 18:41:19,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=96810.0, ans=0.5 2023-10-09 18:41:21,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=96810.0, ans=0.0 2023-10-09 18:41:25,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96810.0, ans=0.1 2023-10-09 18:41:28,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=96856.66666666667, ans=0.125 2023-10-09 18:42:10,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=96996.66666666667, ans=0.0 2023-10-09 18:42:15,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=96996.66666666667, ans=0.0 2023-10-09 18:42:26,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=15.0 2023-10-09 18:42:26,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=97043.33333333333, ans=0.125 2023-10-09 18:42:26,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97043.33333333333, ans=0.1 2023-10-09 18:42:34,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=97090.0, ans=0.0 2023-10-09 18:42:36,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=97090.0, ans=0.125 2023-10-09 18:42:41,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.102e+02 2.329e+02 2.621e+02 4.367e+02, threshold=4.657e+02, percent-clipped=0.0 2023-10-09 18:42:59,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=97183.33333333333, ans=0.1 2023-10-09 18:43:03,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=97230.0, ans=0.0 2023-10-09 18:43:03,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-10-09 18:43:14,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97276.66666666667, ans=0.125 2023-10-09 18:43:19,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2023-10-09 18:43:23,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=97276.66666666667, ans=0.125 2023-10-09 18:43:48,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=97370.0, ans=0.125 2023-10-09 18:43:50,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=97370.0, ans=0.0 2023-10-09 18:44:06,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=97416.66666666667, ans=0.0 2023-10-09 18:44:23,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-10-09 18:44:25,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=97510.0, ans=0.125 2023-10-09 18:44:36,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.57 vs. limit=15.0 2023-10-09 18:44:49,218 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2023-10-09 18:44:50,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=97603.33333333333, ans=0.0 2023-10-09 18:44:51,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-10-09 18:44:51,293 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.121e+02 2.342e+02 2.687e+02 3.873e+02, threshold=4.685e+02, percent-clipped=0.0 2023-10-09 18:45:04,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-10-09 18:45:07,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=97650.0, ans=0.125 2023-10-09 18:45:24,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=97743.33333333333, ans=0.2 2023-10-09 18:45:28,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=97743.33333333333, ans=0.0 2023-10-09 18:45:55,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=97836.66666666667, ans=0.125 2023-10-09 18:46:03,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=97883.33333333333, ans=0.2 2023-10-09 18:46:03,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.22 vs. limit=15.0 2023-10-09 18:46:14,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=97930.0, ans=0.125 2023-10-09 18:46:14,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-10-09 18:46:26,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=97976.66666666667, ans=0.025 2023-10-09 18:46:43,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-10-09 18:46:51,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.968e+02 2.283e+02 2.806e+02 4.340e+02, threshold=4.567e+02, percent-clipped=0.0 2023-10-09 18:47:06,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=98116.66666666667, ans=0.125 2023-10-09 18:47:12,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=98116.66666666667, ans=0.125 2023-10-09 18:47:34,157 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=22.5 2023-10-09 18:47:36,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=98256.66666666667, ans=0.125 2023-10-09 18:48:19,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98443.33333333333, ans=0.1 2023-10-09 18:48:40,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 2.086e+02 2.361e+02 2.719e+02 4.040e+02, threshold=4.721e+02, percent-clipped=0.0 2023-10-09 18:48:41,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98536.66666666667, ans=0.1 2023-10-09 18:48:55,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=98583.33333333333, ans=0.04949747468305833 2023-10-09 18:49:05,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=98630.0, ans=0.125 2023-10-09 18:49:12,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=98676.66666666667, ans=0.125 2023-10-09 18:49:24,297 INFO [train.py:1031] (3/4) Epoch 2, batch 7500, loss[loss=0.2638, simple_loss=0.3373, pruned_loss=0.09515, over 16068.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3684, pruned_loss=0.1193, over 32019605.92 frames. ], batch size: 43, lr: 2.19e-02, grad_scale: 32.0 2023-10-09 18:49:39,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.95 vs. limit=10.0 2023-10-09 18:49:51,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-10-09 18:49:54,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=98816.66666666667, ans=0.125 2023-10-09 18:50:02,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-10-09 18:50:34,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=99003.33333333333, ans=0.125 2023-10-09 18:50:35,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.126e+02 2.486e+02 3.068e+02 5.361e+02, threshold=4.972e+02, percent-clipped=5.0 2023-10-09 18:50:46,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.07 vs. limit=15.0 2023-10-09 18:51:08,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99143.33333333333, ans=0.1 2023-10-09 18:51:08,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=99143.33333333333, ans=0.125 2023-10-09 18:51:37,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99236.66666666667, ans=0.1 2023-10-09 18:51:58,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=99330.0, ans=0.2 2023-10-09 18:52:03,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=99330.0, ans=0.125 2023-10-09 18:52:35,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=99423.33333333333, ans=0.0 2023-10-09 18:52:37,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 2.029e+02 2.260e+02 2.569e+02 4.157e+02, threshold=4.520e+02, percent-clipped=0.0 2023-10-09 18:52:43,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2023-10-09 18:52:45,610 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-10-09 18:53:17,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.28 vs. limit=5.0 2023-10-09 18:54:16,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99890.0, ans=0.1 2023-10-09 18:54:21,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=99890.0, ans=0.015 2023-10-09 18:54:21,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=99890.0, ans=0.0 2023-10-09 18:54:28,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.932e+02 2.184e+02 2.548e+02 4.032e+02, threshold=4.368e+02, percent-clipped=0.0 2023-10-09 18:54:43,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=99983.33333333333, ans=0.2 2023-10-09 18:54:45,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.62 vs. limit=15.0 2023-10-09 18:54:51,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2023-10-09 18:54:52,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=99983.33333333333, ans=0.0 2023-10-09 18:55:00,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.18 vs. limit=10.0 2023-10-09 18:55:23,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=100123.33333333333, ans=0.05 2023-10-09 18:55:25,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=100123.33333333333, ans=0.0 2023-10-09 18:55:28,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-10-09 18:55:31,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=100170.0, ans=0.2 2023-10-09 18:55:39,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-10-09 18:56:00,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=100263.33333333333, ans=0.125 2023-10-09 18:56:12,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=100310.0, ans=0.0 2023-10-09 18:56:27,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=100356.66666666667, ans=0.0 2023-10-09 18:56:28,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.12 vs. limit=15.0 2023-10-09 18:56:32,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.242e+02 2.671e+02 3.272e+02 4.658e+02, threshold=5.342e+02, percent-clipped=4.0 2023-10-09 18:56:42,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.41 vs. limit=22.5 2023-10-09 18:56:45,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=100450.0, ans=0.0 2023-10-09 18:56:53,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=100450.0, ans=0.125 2023-10-09 18:56:57,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=100496.66666666667, ans=0.125 2023-10-09 18:57:18,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=100590.0, ans=0.2 2023-10-09 18:57:46,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=100683.33333333333, ans=0.125 2023-10-09 18:57:47,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-10-09 18:58:02,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=15.0 2023-10-09 18:58:07,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2023-10-09 18:58:26,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.879e+02 2.190e+02 2.546e+02 4.210e+02, threshold=4.380e+02, percent-clipped=0.0 2023-10-09 18:58:33,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100870.0, ans=0.1 2023-10-09 18:59:06,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101010.0, ans=0.1 2023-10-09 18:59:17,652 INFO [train.py:1031] (3/4) Epoch 2, batch 8000, loss[loss=0.3413, simple_loss=0.3706, pruned_loss=0.156, over 15596.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3668, pruned_loss=0.1179, over 32176077.81 frames. ], batch size: 350, lr: 2.16e-02, grad_scale: 32.0 2023-10-09 18:59:34,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2023-10-09 18:59:35,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=101103.33333333333, ans=0.125 2023-10-09 18:59:55,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=101196.66666666667, ans=0.125 2023-10-09 18:59:58,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101196.66666666667, ans=0.125 2023-10-09 19:00:04,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=101243.33333333333, ans=0.0 2023-10-09 19:00:14,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=101290.0, ans=0.0 2023-10-09 19:00:22,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2023-10-09 19:00:22,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-10-09 19:00:24,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.970e+02 2.222e+02 2.628e+02 4.102e+02, threshold=4.444e+02, percent-clipped=0.0 2023-10-09 19:00:25,058 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-10-09 19:00:25,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=101336.66666666667, ans=0.125 2023-10-09 19:00:29,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=101336.66666666667, ans=0.2 2023-10-09 19:00:30,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=101336.66666666667, ans=0.125 2023-10-09 19:00:35,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=101383.33333333333, ans=0.0 2023-10-09 19:00:36,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=101383.33333333333, ans=0.025 2023-10-09 19:00:42,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=101383.33333333333, ans=0.125 2023-10-09 19:00:59,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=101476.66666666667, ans=0.07 2023-10-09 19:01:03,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=101476.66666666667, ans=0.0 2023-10-09 19:01:22,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=101570.0, ans=0.125 2023-10-09 19:02:21,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.056e+02 2.325e+02 2.662e+02 3.495e+02, threshold=4.651e+02, percent-clipped=0.0 2023-10-09 19:02:36,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=101850.0, ans=0.0 2023-10-09 19:02:45,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=101850.0, ans=0.125 2023-10-09 19:02:58,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101896.66666666667, ans=0.1 2023-10-09 19:03:18,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=101990.0, ans=0.0 2023-10-09 19:03:28,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=101990.0, ans=0.125 2023-10-09 19:03:28,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=101990.0, ans=0.125 2023-10-09 19:03:41,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=102083.33333333333, ans=0.07 2023-10-09 19:03:51,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102083.33333333333, ans=0.1 2023-10-09 19:04:04,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=102176.66666666667, ans=0.07 2023-10-09 19:04:16,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=102223.33333333333, ans=0.0 2023-10-09 19:04:27,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 2.008e+02 2.236e+02 2.608e+02 4.426e+02, threshold=4.472e+02, percent-clipped=0.0 2023-10-09 19:04:31,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=102270.0, ans=0.07 2023-10-09 19:05:12,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=102410.0, ans=0.125 2023-10-09 19:05:13,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=102410.0, ans=0.125 2023-10-09 19:05:17,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=102456.66666666667, ans=0.125 2023-10-09 19:05:19,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102456.66666666667, ans=0.1 2023-10-09 19:05:31,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=102503.33333333333, ans=0.2 2023-10-09 19:05:32,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.18 vs. limit=15.0 2023-10-09 19:05:50,860 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2023-10-09 19:06:00,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102643.33333333333, ans=0.1 2023-10-09 19:06:13,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-10-09 19:06:16,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=102690.0, ans=0.0 2023-10-09 19:06:22,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.978e+02 2.282e+02 2.706e+02 4.029e+02, threshold=4.564e+02, percent-clipped=0.0 2023-10-09 19:06:23,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=22.5 2023-10-09 19:06:35,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=102783.33333333333, ans=0.125 2023-10-09 19:06:55,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=102876.66666666667, ans=0.2 2023-10-09 19:06:56,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=102876.66666666667, ans=0.5 2023-10-09 19:06:57,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102876.66666666667, ans=0.125 2023-10-09 19:07:00,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=102876.66666666667, ans=0.2 2023-10-09 19:07:11,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=102923.33333333333, ans=0.5 2023-10-09 19:07:23,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=102970.0, ans=0.125 2023-10-09 19:07:47,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=15.0 2023-10-09 19:07:50,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=103063.33333333333, ans=0.0 2023-10-09 19:07:59,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=103110.0, ans=0.2 2023-10-09 19:08:00,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=103110.0, ans=0.0 2023-10-09 19:08:02,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=103110.0, ans=0.2 2023-10-09 19:08:09,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=103156.66666666667, ans=0.125 2023-10-09 19:08:19,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.094e+02 2.434e+02 2.810e+02 4.004e+02, threshold=4.869e+02, percent-clipped=0.0 2023-10-09 19:08:22,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=103203.33333333333, ans=0.025 2023-10-09 19:08:31,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=103250.0, ans=0.2 2023-10-09 19:08:46,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=103296.66666666667, ans=0.04949747468305833 2023-10-09 19:09:07,684 INFO [train.py:1031] (3/4) Epoch 2, batch 8500, loss[loss=0.3065, simple_loss=0.3694, pruned_loss=0.1218, over 16889.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3663, pruned_loss=0.117, over 32328481.08 frames. ], batch size: 110, lr: 2.14e-02, grad_scale: 32.0 2023-10-09 19:09:08,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=103390.0, ans=0.0 2023-10-09 19:09:09,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103390.0, ans=0.1 2023-10-09 19:09:30,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103483.33333333333, ans=0.1 2023-10-09 19:09:42,723 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:09:54,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=103576.66666666667, ans=0.1 2023-10-09 19:10:10,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.183e+02 2.488e+02 2.913e+02 3.994e+02, threshold=4.977e+02, percent-clipped=0.0 2023-10-09 19:10:13,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=103670.0, ans=0.125 2023-10-09 19:10:16,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=103670.0, ans=0.0 2023-10-09 19:10:34,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=103763.33333333333, ans=0.0 2023-10-09 19:10:41,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=103763.33333333333, ans=0.05 2023-10-09 19:11:14,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=103903.33333333333, ans=0.125 2023-10-09 19:11:26,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=103950.0, ans=0.125 2023-10-09 19:11:39,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=103996.66666666667, ans=0.0 2023-10-09 19:11:43,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=103996.66666666667, ans=0.125 2023-10-09 19:12:21,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.904e+02 2.220e+02 2.528e+02 3.906e+02, threshold=4.440e+02, percent-clipped=0.0 2023-10-09 19:12:54,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=104230.0, ans=0.125 2023-10-09 19:13:09,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=104276.66666666667, ans=0.125 2023-10-09 19:13:11,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=104276.66666666667, ans=0.5 2023-10-09 19:13:52,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=104463.33333333333, ans=0.0 2023-10-09 19:13:56,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.35 vs. limit=22.5 2023-10-09 19:14:04,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=104510.0, ans=0.2 2023-10-09 19:14:04,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=104510.0, ans=0.0 2023-10-09 19:14:06,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=104510.0, ans=0.0 2023-10-09 19:14:09,575 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-10-09 19:14:14,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=104556.66666666667, ans=0.2 2023-10-09 19:14:17,962 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-09 19:14:23,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 2.086e+02 2.326e+02 2.682e+02 3.738e+02, threshold=4.653e+02, percent-clipped=0.0 2023-10-09 19:14:27,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=12.0 2023-10-09 19:14:44,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=104650.0, ans=0.05 2023-10-09 19:15:04,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.90 vs. limit=15.0 2023-10-09 19:15:07,383 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:16:18,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 2.027e+02 2.311e+02 2.736e+02 3.819e+02, threshold=4.621e+02, percent-clipped=0.0 2023-10-09 19:16:22,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2023-10-09 19:16:48,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=105163.33333333333, ans=0.125 2023-10-09 19:16:53,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=105210.0, ans=0.0 2023-10-09 19:17:25,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105350.0, ans=0.1 2023-10-09 19:17:28,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=105350.0, ans=0.125 2023-10-09 19:17:35,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-10-09 19:17:41,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2023-10-09 19:17:47,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=105443.33333333333, ans=0.0 2023-10-09 19:17:56,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=105443.33333333333, ans=0.125 2023-10-09 19:17:57,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=105490.0, ans=10.0 2023-10-09 19:18:08,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-10-09 19:18:08,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 2.011e+02 2.291e+02 2.529e+02 4.186e+02, threshold=4.582e+02, percent-clipped=0.0 2023-10-09 19:18:14,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=105536.66666666667, ans=0.025 2023-10-09 19:18:39,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=105630.0, ans=0.125 2023-10-09 19:18:40,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105630.0, ans=0.1 2023-10-09 19:18:42,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105676.66666666667, ans=0.1 2023-10-09 19:18:55,808 INFO [train.py:1031] (3/4) Epoch 2, batch 9000, loss[loss=0.2924, simple_loss=0.3652, pruned_loss=0.1098, over 16898.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3648, pruned_loss=0.1159, over 32469941.05 frames. ], batch size: 93, lr: 2.12e-02, grad_scale: 64.0 2023-10-09 19:19:00,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=105723.33333333333, ans=0.125 2023-10-09 19:19:01,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=105723.33333333333, ans=0.0 2023-10-09 19:19:07,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=105770.0, ans=0.0 2023-10-09 19:19:23,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105816.66666666667, ans=0.1 2023-10-09 19:19:29,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=105863.33333333333, ans=0.125 2023-10-09 19:19:37,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=105863.33333333333, ans=0.0 2023-10-09 19:19:42,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-09 19:19:49,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105956.66666666667, ans=0.0 2023-10-09 19:19:52,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=8.0 2023-10-09 19:20:02,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.083e+02 2.370e+02 2.904e+02 4.740e+02, threshold=4.740e+02, percent-clipped=1.0 2023-10-09 19:20:22,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=106096.66666666667, ans=0.125 2023-10-09 19:20:27,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=106096.66666666667, ans=0.2 2023-10-09 19:20:33,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=106143.33333333333, ans=0.95 2023-10-09 19:20:39,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=106143.33333333333, ans=0.125 2023-10-09 19:20:54,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=106190.0, ans=0.125 2023-10-09 19:21:03,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=106236.66666666667, ans=0.0 2023-10-09 19:21:13,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=106283.33333333333, ans=0.0 2023-10-09 19:21:23,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=106330.0, ans=0.0 2023-10-09 19:21:25,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=106330.0, ans=0.125 2023-10-09 19:21:26,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=106330.0, ans=0.0 2023-10-09 19:21:47,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=106423.33333333333, ans=0.125 2023-10-09 19:21:51,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 2.116e+02 2.519e+02 3.022e+02 5.639e+02, threshold=5.039e+02, percent-clipped=2.0 2023-10-09 19:22:12,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=106563.33333333333, ans=0.0 2023-10-09 19:22:58,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=106750.0, ans=0.2 2023-10-09 19:23:01,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-10-09 19:23:15,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=106796.66666666667, ans=0.0 2023-10-09 19:23:37,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=106890.0, ans=0.125 2023-10-09 19:23:40,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 2.018e+02 2.303e+02 2.777e+02 4.119e+02, threshold=4.606e+02, percent-clipped=0.0 2023-10-09 19:24:04,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-09 19:24:15,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=107076.66666666667, ans=0.125 2023-10-09 19:24:30,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=107123.33333333333, ans=0.125 2023-10-09 19:24:32,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=107170.0, ans=0.125 2023-10-09 19:24:36,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107170.0, ans=0.1 2023-10-09 19:24:39,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107170.0, ans=0.1 2023-10-09 19:24:39,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=107170.0, ans=0.1 2023-10-09 19:25:05,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=107263.33333333333, ans=0.125 2023-10-09 19:25:07,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=12.0 2023-10-09 19:25:38,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 2.245e+02 2.529e+02 2.814e+02 4.044e+02, threshold=5.058e+02, percent-clipped=0.0 2023-10-09 19:25:54,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=107450.0, ans=0.0 2023-10-09 19:26:24,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=107543.33333333333, ans=0.0 2023-10-09 19:26:26,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=107543.33333333333, ans=0.2 2023-10-09 19:26:29,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=107590.0, ans=0.07 2023-10-09 19:26:36,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=107590.0, ans=0.1 2023-10-09 19:26:46,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=107636.66666666667, ans=0.0 2023-10-09 19:26:46,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=107636.66666666667, ans=0.125 2023-10-09 19:26:47,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=107636.66666666667, ans=0.125 2023-10-09 19:26:48,632 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:26:57,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-10-09 19:27:01,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-09 19:27:13,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.84 vs. limit=22.5 2023-10-09 19:27:19,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=107776.66666666667, ans=0.2 2023-10-09 19:27:34,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107823.33333333333, ans=0.1 2023-10-09 19:27:36,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=107823.33333333333, ans=0.125 2023-10-09 19:27:40,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.999e+02 2.255e+02 2.541e+02 3.520e+02, threshold=4.511e+02, percent-clipped=0.0 2023-10-09 19:27:52,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=107916.66666666667, ans=0.125 2023-10-09 19:28:00,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=107916.66666666667, ans=0.125 2023-10-09 19:28:02,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-10-09 19:28:11,620 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.75 vs. limit=15.0 2023-10-09 19:28:13,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107963.33333333333, ans=0.1 2023-10-09 19:28:27,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108056.66666666667, ans=0.1 2023-10-09 19:28:28,638 INFO [train.py:1031] (3/4) Epoch 2, batch 9500, loss[loss=0.3111, simple_loss=0.3804, pruned_loss=0.1209, over 16841.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.365, pruned_loss=0.1158, over 32538502.94 frames. ], batch size: 175, lr: 2.10e-02, grad_scale: 32.0 2023-10-09 19:28:46,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=108103.33333333333, ans=0.125 2023-10-09 19:28:54,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=108150.0, ans=0.0 2023-10-09 19:28:57,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-10-09 19:29:05,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=108196.66666666667, ans=0.2 2023-10-09 19:29:08,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.92 vs. limit=22.5 2023-10-09 19:29:16,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=108243.33333333333, ans=0.125 2023-10-09 19:29:21,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=108243.33333333333, ans=0.125 2023-10-09 19:29:33,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-10-09 19:29:34,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=108290.0, ans=0.125 2023-10-09 19:29:38,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 2.174e+02 2.595e+02 3.130e+02 5.286e+02, threshold=5.190e+02, percent-clipped=1.0 2023-10-09 19:29:40,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108336.66666666667, ans=0.1 2023-10-09 19:29:49,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=108383.33333333333, ans=0.0 2023-10-09 19:30:11,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=108476.66666666667, ans=0.125 2023-10-09 19:30:29,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=108523.33333333333, ans=0.125 2023-10-09 19:30:42,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=108570.0, ans=0.125 2023-10-09 19:31:06,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=108710.0, ans=0.125 2023-10-09 19:31:14,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.89 vs. limit=10.0 2023-10-09 19:31:25,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=108756.66666666667, ans=0.1 2023-10-09 19:31:31,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 2.017e+02 2.247e+02 2.693e+02 3.556e+02, threshold=4.494e+02, percent-clipped=0.0 2023-10-09 19:31:35,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=108803.33333333333, ans=0.125 2023-10-09 19:31:46,635 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:31:47,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108850.0, ans=0.1 2023-10-09 19:31:54,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108896.66666666667, ans=0.125 2023-10-09 19:32:04,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=108943.33333333333, ans=0.125 2023-10-09 19:32:24,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.30 vs. limit=12.0 2023-10-09 19:32:27,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=109036.66666666667, ans=0.125 2023-10-09 19:32:42,245 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:32:45,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=109083.33333333333, ans=0.125 2023-10-09 19:32:51,561 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.16 vs. limit=15.0 2023-10-09 19:33:00,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=109176.66666666667, ans=0.125 2023-10-09 19:33:16,850 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2023-10-09 19:33:21,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=109270.0, ans=0.07 2023-10-09 19:33:22,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.995e+02 2.294e+02 2.600e+02 3.373e+02, threshold=4.588e+02, percent-clipped=0.0 2023-10-09 19:33:29,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-09 19:33:30,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=109316.66666666667, ans=0.1 2023-10-09 19:34:23,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=109503.33333333333, ans=0.04949747468305833 2023-10-09 19:34:37,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-10-09 19:34:41,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=109596.66666666667, ans=0.04949747468305833 2023-10-09 19:34:57,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=109643.33333333333, ans=0.1 2023-10-09 19:35:02,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=109690.0, ans=0.0 2023-10-09 19:35:15,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.071e+02 2.274e+02 2.604e+02 3.820e+02, threshold=4.549e+02, percent-clipped=0.0 2023-10-09 19:35:39,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-10-09 19:35:41,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=109830.0, ans=0.2 2023-10-09 19:35:53,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=15.0 2023-10-09 19:36:19,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=109970.0, ans=0.125 2023-10-09 19:36:36,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=110063.33333333333, ans=0.125 2023-10-09 19:36:51,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=110110.0, ans=0.5 2023-10-09 19:36:56,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=110156.66666666667, ans=0.0 2023-10-09 19:37:04,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=110203.33333333333, ans=0.2 2023-10-09 19:37:04,557 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.65 vs. limit=22.5 2023-10-09 19:37:07,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.892e+02 2.133e+02 2.333e+02 3.209e+02, threshold=4.267e+02, percent-clipped=0.0 2023-10-09 19:37:15,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=110203.33333333333, ans=0.0 2023-10-09 19:37:34,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=110296.66666666667, ans=0.2 2023-10-09 19:37:34,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-10-09 19:37:48,330 INFO [train.py:1031] (3/4) Epoch 2, batch 10000, loss[loss=0.2831, simple_loss=0.3527, pruned_loss=0.1068, over 15574.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3634, pruned_loss=0.1148, over 32597334.84 frames. ], batch size: 35, lr: 2.08e-02, grad_scale: 32.0 2023-10-09 19:38:11,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=110483.33333333333, ans=0.125 2023-10-09 19:38:15,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=110483.33333333333, ans=15.0 2023-10-09 19:38:21,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=12.0 2023-10-09 19:38:41,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=110576.66666666667, ans=0.125 2023-10-09 19:38:44,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=110623.33333333333, ans=0.125 2023-10-09 19:38:56,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.126e+02 2.323e+02 2.635e+02 3.788e+02, threshold=4.646e+02, percent-clipped=0.0 2023-10-09 19:39:10,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=110716.66666666667, ans=0.125 2023-10-09 19:39:17,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=110763.33333333333, ans=0.125 2023-10-09 19:40:02,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110903.33333333333, ans=0.1 2023-10-09 19:40:03,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=110950.0, ans=0.035 2023-10-09 19:40:05,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=110950.0, ans=0.125 2023-10-09 19:40:27,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=111043.33333333333, ans=0.0 2023-10-09 19:40:41,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=111090.0, ans=0.125 2023-10-09 19:40:50,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.043e+02 2.430e+02 2.863e+02 4.874e+02, threshold=4.859e+02, percent-clipped=1.0 2023-10-09 19:40:51,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=111136.66666666667, ans=0.125 2023-10-09 19:40:55,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=111136.66666666667, ans=0.0 2023-10-09 19:41:03,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=111183.33333333333, ans=0.125 2023-10-09 19:41:11,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=111230.0, ans=0.125 2023-10-09 19:41:24,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=111276.66666666667, ans=0.125 2023-10-09 19:41:55,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.48 vs. limit=22.5 2023-10-09 19:42:00,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=111416.66666666667, ans=0.125 2023-10-09 19:42:02,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=111416.66666666667, ans=0.2 2023-10-09 19:42:08,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=111416.66666666667, ans=0.2 2023-10-09 19:42:11,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=111463.33333333333, ans=0.125 2023-10-09 19:42:27,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=111510.0, ans=0.2 2023-10-09 19:42:37,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=111556.66666666667, ans=0.2 2023-10-09 19:42:44,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.080e+02 2.350e+02 2.749e+02 4.092e+02, threshold=4.700e+02, percent-clipped=0.0 2023-10-09 19:42:49,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-10-09 19:42:55,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2023-10-09 19:43:08,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=111696.66666666667, ans=0.125 2023-10-09 19:43:11,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111696.66666666667, ans=0.1 2023-10-09 19:43:24,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=111743.33333333333, ans=10.0 2023-10-09 19:43:38,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-10-09 19:44:09,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=111930.0, ans=0.125 2023-10-09 19:44:10,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=111930.0, ans=0.2 2023-10-09 19:44:11,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=111930.0, ans=0.0 2023-10-09 19:44:18,957 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-09 19:44:29,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=111976.66666666667, ans=0.125 2023-10-09 19:44:44,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.935e+02 2.314e+02 2.574e+02 4.222e+02, threshold=4.628e+02, percent-clipped=0.0 2023-10-09 19:44:48,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112070.0, ans=0.1 2023-10-09 19:44:58,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=112116.66666666667, ans=0.125 2023-10-09 19:45:00,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=112116.66666666667, ans=0.125 2023-10-09 19:45:07,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=112163.33333333333, ans=0.04949747468305833 2023-10-09 19:45:25,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=112210.0, ans=0.07 2023-10-09 19:45:28,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=112256.66666666667, ans=0.0 2023-10-09 19:45:28,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-10-09 19:45:33,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112256.66666666667, ans=0.1 2023-10-09 19:45:33,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-10-09 19:45:36,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=112256.66666666667, ans=0.0 2023-10-09 19:45:36,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.85 vs. limit=15.0 2023-10-09 19:45:40,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=112303.33333333333, ans=0.125 2023-10-09 19:45:47,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.69 vs. limit=22.5 2023-10-09 19:46:06,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=22.5 2023-10-09 19:46:10,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=112396.66666666667, ans=0.0 2023-10-09 19:46:18,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=112443.33333333333, ans=0.125 2023-10-09 19:46:21,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=112443.33333333333, ans=0.0 2023-10-09 19:46:40,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=112536.66666666667, ans=0.0 2023-10-09 19:46:41,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.102e+02 2.406e+02 2.971e+02 5.102e+02, threshold=4.812e+02, percent-clipped=2.0 2023-10-09 19:46:43,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=112536.66666666667, ans=0.125 2023-10-09 19:46:48,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.09 vs. limit=22.5 2023-10-09 19:46:51,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=112583.33333333333, ans=0.125 2023-10-09 19:47:22,787 INFO [train.py:1031] (3/4) Epoch 2, batch 10500, loss[loss=0.2657, simple_loss=0.31, pruned_loss=0.1107, over 12539.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3633, pruned_loss=0.1144, over 32671958.30 frames. ], batch size: 440, lr: 2.06e-02, grad_scale: 32.0 2023-10-09 19:47:24,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-10-09 19:47:30,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=112723.33333333333, ans=0.125 2023-10-09 19:47:49,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=112816.66666666667, ans=0.125 2023-10-09 19:48:01,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-10-09 19:48:03,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=112863.33333333333, ans=0.0 2023-10-09 19:48:06,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-10-09 19:48:15,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=112956.66666666667, ans=0.125 2023-10-09 19:48:27,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=113003.33333333333, ans=0.125 2023-10-09 19:48:30,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 2.013e+02 2.301e+02 2.879e+02 4.795e+02, threshold=4.602e+02, percent-clipped=0.0 2023-10-09 19:49:04,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=113096.66666666667, ans=0.2 2023-10-09 19:49:19,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=113190.0, ans=0.0 2023-10-09 19:49:48,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=113283.33333333333, ans=0.0 2023-10-09 19:49:54,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=113330.0, ans=0.2 2023-10-09 19:50:02,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=113330.0, ans=0.125 2023-10-09 19:50:02,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=113330.0, ans=0.125 2023-10-09 19:50:25,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113470.0, ans=0.0 2023-10-09 19:50:28,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.095e+02 2.369e+02 2.692e+02 5.169e+02, threshold=4.738e+02, percent-clipped=1.0 2023-10-09 19:50:31,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=113470.0, ans=0.2 2023-10-09 19:50:34,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=113470.0, ans=0.125 2023-10-09 19:50:36,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.83 vs. limit=15.0 2023-10-09 19:50:52,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=113563.33333333333, ans=0.2 2023-10-09 19:51:01,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=113610.0, ans=0.125 2023-10-09 19:51:02,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=113610.0, ans=0.0 2023-10-09 19:51:04,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=113610.0, ans=0.0 2023-10-09 19:51:06,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=113610.0, ans=0.04949747468305833 2023-10-09 19:51:09,704 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:51:39,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=113750.0, ans=0.09899494936611666 2023-10-09 19:51:54,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=113796.66666666667, ans=0.125 2023-10-09 19:51:57,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=113843.33333333333, ans=0.125 2023-10-09 19:51:57,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=113843.33333333333, ans=0.1 2023-10-09 19:51:58,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113843.33333333333, ans=0.1 2023-10-09 19:51:59,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=113843.33333333333, ans=0.125 2023-10-09 19:52:03,478 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=22.5 2023-10-09 19:52:10,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=113890.0, ans=0.0 2023-10-09 19:52:21,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.067e+02 2.328e+02 2.705e+02 6.123e+02, threshold=4.656e+02, percent-clipped=1.0 2023-10-09 19:52:31,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=113983.33333333333, ans=0.125 2023-10-09 19:52:35,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=113983.33333333333, ans=0.125 2023-10-09 19:52:56,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=114076.66666666667, ans=0.125 2023-10-09 19:53:51,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=114310.0, ans=0.2 2023-10-09 19:53:58,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=114356.66666666667, ans=0.0 2023-10-09 19:54:03,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=114356.66666666667, ans=0.0 2023-10-09 19:54:09,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 2.018e+02 2.376e+02 2.691e+02 4.537e+02, threshold=4.752e+02, percent-clipped=0.0 2023-10-09 19:54:29,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.39 vs. limit=6.0 2023-10-09 19:54:33,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114496.66666666667, ans=0.1 2023-10-09 19:55:37,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=114776.66666666667, ans=0.04949747468305833 2023-10-09 19:55:47,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=114823.33333333333, ans=0.0 2023-10-09 19:55:57,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.956e+02 2.232e+02 2.602e+02 3.731e+02, threshold=4.465e+02, percent-clipped=0.0 2023-10-09 19:56:19,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=114963.33333333333, ans=0.1 2023-10-09 19:56:19,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.55 vs. limit=15.0 2023-10-09 19:56:23,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=114963.33333333333, ans=0.125 2023-10-09 19:56:26,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=115010.0, ans=0.125 2023-10-09 19:56:37,492 INFO [train.py:1031] (3/4) Epoch 2, batch 11000, loss[loss=0.2647, simple_loss=0.3511, pruned_loss=0.08911, over 16886.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3626, pruned_loss=0.1139, over 32693135.21 frames. ], batch size: 104, lr: 2.04e-02, grad_scale: 32.0 2023-10-09 19:56:40,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.22 vs. limit=6.0 2023-10-09 19:56:43,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=115056.66666666667, ans=0.125 2023-10-09 19:56:43,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=115056.66666666667, ans=0.2 2023-10-09 19:57:06,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=115150.0, ans=0.05 2023-10-09 19:57:17,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=115196.66666666667, ans=0.0 2023-10-09 19:57:22,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115243.33333333333, ans=0.125 2023-10-09 19:57:28,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=115243.33333333333, ans=0.125 2023-10-09 19:57:28,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.57 vs. limit=22.5 2023-10-09 19:57:49,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.272e+02 2.627e+02 3.107e+02 4.248e+02, threshold=5.254e+02, percent-clipped=0.0 2023-10-09 19:57:53,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=115336.66666666667, ans=0.125 2023-10-09 19:58:39,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115523.33333333333, ans=0.1 2023-10-09 19:58:54,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115570.0, ans=0.1 2023-10-09 19:59:14,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=115663.33333333333, ans=0.2 2023-10-09 19:59:40,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=115756.66666666667, ans=0.0 2023-10-09 19:59:40,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=115756.66666666667, ans=0.125 2023-10-09 19:59:49,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 2.048e+02 2.276e+02 2.586e+02 3.586e+02, threshold=4.553e+02, percent-clipped=0.0 2023-10-09 19:59:59,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=115850.0, ans=10.0 2023-10-09 20:00:14,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=115896.66666666667, ans=0.125 2023-10-09 20:00:36,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2023-10-09 20:01:04,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=116130.0, ans=0.0 2023-10-09 20:01:18,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=116176.66666666667, ans=0.125 2023-10-09 20:01:25,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=116223.33333333333, ans=0.2 2023-10-09 20:01:36,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 2.054e+02 2.344e+02 2.826e+02 4.170e+02, threshold=4.688e+02, percent-clipped=0.0 2023-10-09 20:01:43,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116270.0, ans=0.1 2023-10-09 20:01:45,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=116270.0, ans=0.2 2023-10-09 20:01:54,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=116316.66666666667, ans=0.125 2023-10-09 20:02:07,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=116363.33333333333, ans=0.0 2023-10-09 20:02:08,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.38 vs. limit=15.0 2023-10-09 20:02:13,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=116410.0, ans=0.0 2023-10-09 20:02:23,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=116410.0, ans=0.125 2023-10-09 20:02:23,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=116456.66666666667, ans=0.0 2023-10-09 20:02:27,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=116456.66666666667, ans=0.125 2023-10-09 20:02:34,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=116456.66666666667, ans=0.125 2023-10-09 20:02:39,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=12.0 2023-10-09 20:02:41,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=116503.33333333333, ans=0.0 2023-10-09 20:02:52,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.38 vs. limit=15.0 2023-10-09 20:02:53,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116550.0, ans=0.1 2023-10-09 20:03:09,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=116643.33333333333, ans=0.0 2023-10-09 20:03:13,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-09 20:03:30,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=116690.0, ans=0.125 2023-10-09 20:03:37,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.968e+02 2.234e+02 2.630e+02 3.849e+02, threshold=4.469e+02, percent-clipped=0.0 2023-10-09 20:03:39,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=116736.66666666667, ans=0.0 2023-10-09 20:03:44,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=116783.33333333333, ans=0.125 2023-10-09 20:03:49,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=116783.33333333333, ans=0.0 2023-10-09 20:03:52,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=116783.33333333333, ans=0.125 2023-10-09 20:04:08,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=116876.66666666667, ans=0.2 2023-10-09 20:04:08,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116876.66666666667, ans=0.1 2023-10-09 20:04:38,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=116970.0, ans=0.0 2023-10-09 20:04:42,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=117016.66666666667, ans=0.0 2023-10-09 20:04:56,912 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:05:34,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-09 20:05:34,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.222e+02 2.512e+02 2.981e+02 4.436e+02, threshold=5.024e+02, percent-clipped=0.0 2023-10-09 20:05:59,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=117296.66666666667, ans=0.125 2023-10-09 20:05:59,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=117296.66666666667, ans=0.0 2023-10-09 20:06:21,499 INFO [train.py:1031] (3/4) Epoch 2, batch 11500, loss[loss=0.3141, simple_loss=0.3795, pruned_loss=0.1243, over 16470.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3615, pruned_loss=0.1131, over 32732618.20 frames. ], batch size: 266, lr: 2.02e-02, grad_scale: 16.0 2023-10-09 20:07:07,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=117576.66666666667, ans=0.09899494936611666 2023-10-09 20:07:13,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=117576.66666666667, ans=0.0 2023-10-09 20:07:27,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.96 vs. limit=15.0 2023-10-09 20:07:34,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=117670.0, ans=0.07 2023-10-09 20:07:39,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.136e+02 2.363e+02 2.771e+02 4.950e+02, threshold=4.727e+02, percent-clipped=0.0 2023-10-09 20:08:02,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=117763.33333333333, ans=0.125 2023-10-09 20:08:10,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=15.0 2023-10-09 20:08:45,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=117903.33333333333, ans=10.0 2023-10-09 20:09:24,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118090.0, ans=0.0 2023-10-09 20:09:30,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=118136.66666666667, ans=0.2 2023-10-09 20:09:33,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=118136.66666666667, ans=0.125 2023-10-09 20:09:35,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.974e+02 2.221e+02 2.547e+02 3.826e+02, threshold=4.443e+02, percent-clipped=0.0 2023-10-09 20:09:37,768 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:10:07,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.32 vs. limit=15.0 2023-10-09 20:10:12,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=118276.66666666667, ans=0.125 2023-10-09 20:10:19,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=118323.33333333333, ans=0.0 2023-10-09 20:10:26,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=12.0 2023-10-09 20:10:46,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.53 vs. limit=6.0 2023-10-09 20:10:52,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-10-09 20:10:55,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.63 vs. limit=6.0 2023-10-09 20:11:04,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118510.0, ans=0.125 2023-10-09 20:11:06,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=118510.0, ans=0.025 2023-10-09 20:11:18,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-10-09 20:11:22,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=118556.66666666667, ans=0.125 2023-10-09 20:11:35,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=118603.33333333333, ans=0.125 2023-10-09 20:11:37,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.082e+02 2.308e+02 2.745e+02 4.017e+02, threshold=4.616e+02, percent-clipped=0.0 2023-10-09 20:11:49,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=118650.0, ans=0.05 2023-10-09 20:12:24,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.59 vs. limit=22.5 2023-10-09 20:12:28,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=118790.0, ans=0.0 2023-10-09 20:12:54,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=118883.33333333333, ans=0.0 2023-10-09 20:13:03,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=118930.0, ans=0.0 2023-10-09 20:13:06,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=118930.0, ans=0.125 2023-10-09 20:13:08,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=12.0 2023-10-09 20:13:44,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.042e+02 2.316e+02 2.818e+02 4.963e+02, threshold=4.633e+02, percent-clipped=2.0 2023-10-09 20:13:48,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119070.0, ans=0.125 2023-10-09 20:13:50,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=119116.66666666667, ans=0.0 2023-10-09 20:14:14,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119163.33333333333, ans=0.1 2023-10-09 20:14:16,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=119163.33333333333, ans=0.125 2023-10-09 20:14:22,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=119210.0, ans=0.07 2023-10-09 20:14:25,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=119210.0, ans=0.0 2023-10-09 20:14:29,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=15.0 2023-10-09 20:15:25,441 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:15:40,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=119536.66666666667, ans=0.2 2023-10-09 20:15:44,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 2.049e+02 2.330e+02 2.616e+02 3.748e+02, threshold=4.659e+02, percent-clipped=0.0 2023-10-09 20:16:13,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=119676.66666666667, ans=0.125 2023-10-09 20:16:24,580 INFO [train.py:1031] (3/4) Epoch 2, batch 12000, loss[loss=0.2715, simple_loss=0.3154, pruned_loss=0.1138, over 12889.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3606, pruned_loss=0.1118, over 32773724.35 frames. ], batch size: 440, lr: 2.00e-02, grad_scale: 32.0 2023-10-09 20:16:43,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=119770.0, ans=0.0 2023-10-09 20:16:51,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.12 vs. limit=22.5 2023-10-09 20:16:58,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119863.33333333333, ans=0.125 2023-10-09 20:17:21,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-10-09 20:17:39,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 2.021e+02 2.208e+02 2.660e+02 4.127e+02, threshold=4.416e+02, percent-clipped=0.0 2023-10-09 20:17:47,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=120050.0, ans=0.125 2023-10-09 20:18:05,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120096.66666666667, ans=0.125 2023-10-09 20:18:08,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120143.33333333333, ans=0.1 2023-10-09 20:18:18,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=120143.33333333333, ans=0.125 2023-10-09 20:18:18,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=120143.33333333333, ans=0.125 2023-10-09 20:18:35,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=120236.66666666667, ans=0.0 2023-10-09 20:19:07,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=12.0 2023-10-09 20:19:07,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=120376.66666666667, ans=0.2 2023-10-09 20:19:29,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=120423.33333333333, ans=0.125 2023-10-09 20:19:34,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=120470.0, ans=0.125 2023-10-09 20:19:35,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.934e+02 2.225e+02 2.573e+02 3.904e+02, threshold=4.450e+02, percent-clipped=0.0 2023-10-09 20:19:44,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=120516.66666666667, ans=0.035 2023-10-09 20:19:53,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-10-09 20:19:56,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.96 vs. limit=12.0 2023-10-09 20:19:57,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=120563.33333333333, ans=0.09899494936611666 2023-10-09 20:19:59,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=120563.33333333333, ans=0.07 2023-10-09 20:20:16,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=120656.66666666667, ans=0.2 2023-10-09 20:20:32,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=120703.33333333333, ans=0.0 2023-10-09 20:20:42,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=120750.0, ans=0.0 2023-10-09 20:20:46,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120750.0, ans=0.1 2023-10-09 20:20:50,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=120796.66666666667, ans=0.0 2023-10-09 20:21:00,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=120843.33333333333, ans=0.125 2023-10-09 20:21:27,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 2.151e+02 2.507e+02 3.017e+02 4.295e+02, threshold=5.014e+02, percent-clipped=0.0 2023-10-09 20:21:29,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=120936.66666666667, ans=0.125 2023-10-09 20:21:36,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=120983.33333333333, ans=0.09899494936611666 2023-10-09 20:21:49,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121030.0, ans=0.1 2023-10-09 20:21:49,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=121030.0, ans=10.0 2023-10-09 20:21:59,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=121076.66666666667, ans=0.0 2023-10-09 20:22:08,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=121123.33333333333, ans=0.125 2023-10-09 20:22:09,406 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:22:12,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.02 vs. limit=15.0 2023-10-09 20:22:17,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=15.0 2023-10-09 20:22:18,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=121170.0, ans=0.2 2023-10-09 20:22:33,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=121216.66666666667, ans=0.125 2023-10-09 20:22:37,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=121216.66666666667, ans=0.025 2023-10-09 20:23:05,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=121356.66666666667, ans=0.125 2023-10-09 20:23:11,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=121356.66666666667, ans=0.125 2023-10-09 20:23:23,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.964e+02 2.147e+02 2.424e+02 3.507e+02, threshold=4.294e+02, percent-clipped=0.0 2023-10-09 20:23:31,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=121450.0, ans=0.1 2023-10-09 20:23:59,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=121543.33333333333, ans=0.0 2023-10-09 20:24:18,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=12.0 2023-10-09 20:24:21,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=121636.66666666667, ans=0.5 2023-10-09 20:24:30,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-10-09 20:24:39,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=121730.0, ans=0.05 2023-10-09 20:24:40,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=121730.0, ans=0.125 2023-10-09 20:24:56,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-10-09 20:25:05,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121823.33333333333, ans=0.125 2023-10-09 20:25:22,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.118e+02 2.431e+02 2.818e+02 3.970e+02, threshold=4.862e+02, percent-clipped=0.0 2023-10-09 20:25:23,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=121870.0, ans=0.0 2023-10-09 20:25:23,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.45 vs. limit=15.0 2023-10-09 20:25:39,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=121963.33333333333, ans=0.0 2023-10-09 20:25:41,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.16 vs. limit=15.0 2023-10-09 20:25:59,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2023-10-09 20:26:04,614 INFO [train.py:1031] (3/4) Epoch 2, batch 12500, loss[loss=0.2712, simple_loss=0.3512, pruned_loss=0.09563, over 16904.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3597, pruned_loss=0.1115, over 32765585.96 frames. ], batch size: 130, lr: 1.99e-02, grad_scale: 32.0 2023-10-09 20:26:05,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=122056.66666666667, ans=0.2 2023-10-09 20:26:33,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=122150.0, ans=0.125 2023-10-09 20:27:08,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.71 vs. limit=15.0 2023-10-09 20:27:08,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=122336.66666666667, ans=0.2 2023-10-09 20:27:14,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.988e+02 2.265e+02 2.583e+02 3.799e+02, threshold=4.530e+02, percent-clipped=0.0 2023-10-09 20:27:26,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=122383.33333333333, ans=0.2 2023-10-09 20:27:39,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=122430.0, ans=15.0 2023-10-09 20:27:54,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122476.66666666667, ans=0.125 2023-10-09 20:27:55,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=122476.66666666667, ans=0.025 2023-10-09 20:27:55,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=122476.66666666667, ans=0.125 2023-10-09 20:28:11,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=122570.0, ans=0.0 2023-10-09 20:28:14,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=122570.0, ans=0.2 2023-10-09 20:28:19,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=122616.66666666667, ans=0.125 2023-10-09 20:28:40,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=122663.33333333333, ans=0.0 2023-10-09 20:28:58,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=122756.66666666667, ans=0.125 2023-10-09 20:29:13,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.990e+02 2.305e+02 2.591e+02 3.767e+02, threshold=4.610e+02, percent-clipped=0.0 2023-10-09 20:29:20,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=122850.0, ans=0.125 2023-10-09 20:29:20,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=122850.0, ans=0.0 2023-10-09 20:29:40,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=122896.66666666667, ans=0.0 2023-10-09 20:29:43,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=122943.33333333333, ans=0.2 2023-10-09 20:29:43,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.21 vs. limit=22.5 2023-10-09 20:29:59,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122990.0, ans=0.125 2023-10-09 20:30:28,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=123130.0, ans=0.0 2023-10-09 20:31:08,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.040e+02 2.323e+02 2.740e+02 4.359e+02, threshold=4.647e+02, percent-clipped=0.0 2023-10-09 20:31:08,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.44 vs. limit=22.5 2023-10-09 20:31:10,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=123270.0, ans=0.0 2023-10-09 20:31:12,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=123270.0, ans=0.0 2023-10-09 20:31:39,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=123410.0, ans=0.0 2023-10-09 20:31:55,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=123456.66666666667, ans=0.0 2023-10-09 20:32:00,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=123503.33333333333, ans=0.125 2023-10-09 20:32:10,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=123503.33333333333, ans=0.125 2023-10-09 20:32:26,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.68 vs. limit=6.0 2023-10-09 20:32:39,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=123643.33333333333, ans=10.0 2023-10-09 20:33:08,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 2.059e+02 2.332e+02 2.748e+02 3.985e+02, threshold=4.664e+02, percent-clipped=0.0 2023-10-09 20:33:35,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=123830.0, ans=0.2 2023-10-09 20:33:43,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=123876.66666666667, ans=0.125 2023-10-09 20:33:47,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=123876.66666666667, ans=0.0 2023-10-09 20:34:03,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.74 vs. limit=22.5 2023-10-09 20:34:26,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=124063.33333333333, ans=0.125 2023-10-09 20:34:30,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=124063.33333333333, ans=0.0 2023-10-09 20:34:41,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.13 vs. limit=15.0 2023-10-09 20:34:46,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=124110.0, ans=0.125 2023-10-09 20:34:54,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124156.66666666667, ans=0.125 2023-10-09 20:35:09,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.043e+02 2.308e+02 2.671e+02 3.646e+02, threshold=4.616e+02, percent-clipped=0.0 2023-10-09 20:35:33,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=124296.66666666667, ans=0.125 2023-10-09 20:35:33,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=124296.66666666667, ans=0.07 2023-10-09 20:35:47,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=124343.33333333333, ans=0.125 2023-10-09 20:35:50,382 INFO [train.py:1031] (3/4) Epoch 2, batch 13000, loss[loss=0.2974, simple_loss=0.3654, pruned_loss=0.1147, over 16938.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3601, pruned_loss=0.1114, over 32776834.15 frames. ], batch size: 138, lr: 1.97e-02, grad_scale: 32.0 2023-10-09 20:36:08,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124436.66666666667, ans=0.125 2023-10-09 20:36:15,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=124436.66666666667, ans=0.125 2023-10-09 20:36:37,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=15.0 2023-10-09 20:36:46,434 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:36:56,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=124576.66666666667, ans=0.2 2023-10-09 20:37:00,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=124623.33333333333, ans=0.0 2023-10-09 20:37:02,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-10-09 20:37:18,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.942e+02 2.218e+02 2.568e+02 3.708e+02, threshold=4.436e+02, percent-clipped=0.0 2023-10-09 20:37:21,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2023-10-09 20:37:35,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=124716.66666666667, ans=0.125 2023-10-09 20:37:42,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=124763.33333333333, ans=0.0 2023-10-09 20:37:50,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=124810.0, ans=0.125 2023-10-09 20:37:58,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=124810.0, ans=0.0 2023-10-09 20:38:36,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=124950.0, ans=0.2 2023-10-09 20:39:06,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-10-09 20:39:07,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.52 vs. limit=15.0 2023-10-09 20:39:18,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 2.039e+02 2.307e+02 2.809e+02 4.569e+02, threshold=4.613e+02, percent-clipped=1.0 2023-10-09 20:40:06,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-10-09 20:40:27,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=125416.66666666667, ans=0.2 2023-10-09 20:40:27,128 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:40:27,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=125416.66666666667, ans=0.0 2023-10-09 20:40:43,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-10-09 20:40:49,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=125510.0, ans=0.0 2023-10-09 20:40:53,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=125510.0, ans=0.125 2023-10-09 20:41:16,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 2.138e+02 2.441e+02 2.956e+02 4.858e+02, threshold=4.881e+02, percent-clipped=1.0 2023-10-09 20:41:31,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-09 20:41:45,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=125696.66666666667, ans=0.0 2023-10-09 20:41:45,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=125743.33333333333, ans=0.0 2023-10-09 20:41:51,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-10-09 20:41:55,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=125743.33333333333, ans=0.0 2023-10-09 20:42:12,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=125836.66666666667, ans=0.2 2023-10-09 20:42:16,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125836.66666666667, ans=0.0 2023-10-09 20:42:32,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125930.0, ans=0.1 2023-10-09 20:42:54,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2023-10-09 20:42:57,374 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:43:06,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 2.039e+02 2.341e+02 2.618e+02 4.220e+02, threshold=4.682e+02, percent-clipped=0.0 2023-10-09 20:43:08,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=126070.0, ans=0.0 2023-10-09 20:43:30,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=126163.33333333333, ans=0.125 2023-10-09 20:43:40,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=126210.0, ans=0.1 2023-10-09 20:44:39,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=15.0 2023-10-09 20:44:39,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-10-09 20:44:42,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.18 vs. limit=15.0 2023-10-09 20:44:55,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.995e+02 2.342e+02 2.736e+02 3.888e+02, threshold=4.684e+02, percent-clipped=0.0 2023-10-09 20:45:01,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=126583.33333333333, ans=0.0 2023-10-09 20:45:06,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.87 vs. limit=15.0 2023-10-09 20:45:09,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=126583.33333333333, ans=0.07 2023-10-09 20:45:17,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-10-09 20:45:23,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-10-09 20:45:25,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=126676.66666666667, ans=0.125 2023-10-09 20:45:30,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=6.0 2023-10-09 20:45:34,076 INFO [train.py:1031] (3/4) Epoch 2, batch 13500, loss[loss=0.2646, simple_loss=0.3117, pruned_loss=0.1087, over 12702.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3586, pruned_loss=0.1104, over 32808513.53 frames. ], batch size: 440, lr: 1.95e-02, grad_scale: 64.0 2023-10-09 20:45:38,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=126723.33333333333, ans=0.2 2023-10-09 20:45:41,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=126723.33333333333, ans=0.2 2023-10-09 20:45:47,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=126770.0, ans=0.0 2023-10-09 20:46:06,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=126816.66666666667, ans=0.125 2023-10-09 20:46:13,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=126863.33333333333, ans=0.125 2023-10-09 20:46:15,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.42 vs. limit=10.0 2023-10-09 20:46:33,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=126956.66666666667, ans=0.125 2023-10-09 20:46:41,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-09 20:46:46,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.076e+02 2.424e+02 2.712e+02 4.296e+02, threshold=4.848e+02, percent-clipped=0.0 2023-10-09 20:47:17,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=127143.33333333333, ans=0.0 2023-10-09 20:47:30,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.81 vs. limit=15.0 2023-10-09 20:47:38,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=127236.66666666667, ans=0.125 2023-10-09 20:47:40,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=15.0 2023-10-09 20:47:46,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=127283.33333333333, ans=0.0 2023-10-09 20:47:46,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=127283.33333333333, ans=0.125 2023-10-09 20:47:55,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=127330.0, ans=0.125 2023-10-09 20:48:00,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=127330.0, ans=15.0 2023-10-09 20:48:06,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=127376.66666666667, ans=0.07 2023-10-09 20:48:11,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=127376.66666666667, ans=0.0 2023-10-09 20:48:13,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=127423.33333333333, ans=0.125 2023-10-09 20:48:49,827 INFO [train.py:1031] (3/4) Epoch 3, batch 0, loss[loss=0.3162, simple_loss=0.3696, pruned_loss=0.1314, over 16054.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3696, pruned_loss=0.1314, over 16054.00 frames. ], batch size: 296, lr: 1.55e-02, grad_scale: 32.0 2023-10-09 20:48:49,828 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-09 20:49:03,950 INFO [train.py:1063] (3/4) Epoch 3, validation: loss=0.2699, simple_loss=0.3526, pruned_loss=0.09359, over 1020973.00 frames. 2023-10-09 20:49:03,952 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-09 20:49:07,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=127446.66666666667, ans=0.0 2023-10-09 20:49:09,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=127446.66666666667, ans=0.0 2023-10-09 20:49:17,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.76 vs. limit=15.0 2023-10-09 20:49:18,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.087e+02 2.261e+02 2.637e+02 4.213e+02, threshold=4.522e+02, percent-clipped=0.0 2023-10-09 20:49:21,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-10-09 20:49:25,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=127493.33333333333, ans=0.125 2023-10-09 20:49:28,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=22.5 2023-10-09 20:49:54,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-10-09 20:50:11,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=127680.0, ans=0.125 2023-10-09 20:50:23,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=127726.66666666667, ans=0.125 2023-10-09 20:50:25,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=127726.66666666667, ans=0.125 2023-10-09 20:50:27,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=127773.33333333333, ans=0.125 2023-10-09 20:50:50,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=127866.66666666667, ans=0.2 2023-10-09 20:51:12,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.859e+02 2.073e+02 2.445e+02 3.112e+02, threshold=4.147e+02, percent-clipped=0.0 2023-10-09 20:51:31,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=128006.66666666667, ans=0.2 2023-10-09 20:51:52,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=128100.0, ans=0.125 2023-10-09 20:51:56,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=15.0 2023-10-09 20:52:15,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-10-09 20:52:21,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=128240.0, ans=0.5 2023-10-09 20:52:25,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=128240.0, ans=0.2 2023-10-09 20:52:37,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=128286.66666666667, ans=0.2 2023-10-09 20:52:42,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=128333.33333333333, ans=15.0 2023-10-09 20:52:51,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=128333.33333333333, ans=0.125 2023-10-09 20:53:00,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=128380.0, ans=0.125 2023-10-09 20:53:04,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.883e+02 2.159e+02 2.558e+02 4.049e+02, threshold=4.317e+02, percent-clipped=0.0 2023-10-09 20:53:27,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=128520.0, ans=0.0 2023-10-09 20:53:40,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=128566.66666666667, ans=0.125 2023-10-09 20:54:09,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=22.5 2023-10-09 20:54:19,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.36 vs. limit=22.5 2023-10-09 20:54:28,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.04 vs. limit=15.0 2023-10-09 20:54:43,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=128800.0, ans=0.025 2023-10-09 20:54:47,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128846.66666666667, ans=0.125 2023-10-09 20:54:47,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=22.5 2023-10-09 20:55:00,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.873e+02 2.047e+02 2.304e+02 3.922e+02, threshold=4.094e+02, percent-clipped=0.0 2023-10-09 20:55:04,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=128893.33333333333, ans=0.125 2023-10-09 20:55:15,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=128940.0, ans=0.1 2023-10-09 20:55:25,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128986.66666666667, ans=0.1 2023-10-09 20:55:26,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=128986.66666666667, ans=0.0 2023-10-09 20:55:35,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129033.33333333333, ans=0.1 2023-10-09 20:55:39,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=129033.33333333333, ans=0.125 2023-10-09 20:55:39,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=129033.33333333333, ans=0.125 2023-10-09 20:56:02,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=129126.66666666667, ans=0.2 2023-10-09 20:56:07,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=129173.33333333333, ans=0.125 2023-10-09 20:56:36,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=129266.66666666667, ans=0.0 2023-10-09 20:56:53,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=129313.33333333333, ans=0.0 2023-10-09 20:56:54,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.916e+02 2.154e+02 2.409e+02 3.347e+02, threshold=4.309e+02, percent-clipped=0.0 2023-10-09 20:57:27,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=12.0 2023-10-09 20:57:50,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=129546.66666666667, ans=0.2 2023-10-09 20:58:19,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129686.66666666667, ans=0.0 2023-10-09 20:58:22,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2023-10-09 20:58:29,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129733.33333333333, ans=0.0 2023-10-09 20:58:35,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=129733.33333333333, ans=0.125 2023-10-09 20:58:38,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=129733.33333333333, ans=0.0 2023-10-09 20:58:41,170 INFO [train.py:1031] (3/4) Epoch 3, batch 500, loss[loss=0.2592, simple_loss=0.3353, pruned_loss=0.09153, over 16877.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3503, pruned_loss=0.1031, over 7288691.71 frames. ], batch size: 116, lr: 1.54e-02, grad_scale: 32.0 2023-10-09 20:58:51,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.875e+02 2.153e+02 2.502e+02 4.004e+02, threshold=4.305e+02, percent-clipped=0.0 2023-10-09 20:59:00,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.17 vs. limit=22.5 2023-10-09 20:59:18,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129920.0, ans=0.1 2023-10-09 20:59:22,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=129920.0, ans=0.2 2023-10-09 21:00:42,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.868e+02 2.123e+02 2.425e+02 4.212e+02, threshold=4.246e+02, percent-clipped=0.0 2023-10-09 21:00:48,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-10-09 21:00:51,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=130293.33333333333, ans=6.0 2023-10-09 21:01:01,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-10-09 21:01:03,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-10-09 21:01:46,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=130526.66666666667, ans=0.125 2023-10-09 21:01:53,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130573.33333333333, ans=0.1 2023-10-09 21:01:54,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=130573.33333333333, ans=0.125 2023-10-09 21:02:06,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=130620.0, ans=0.0 2023-10-09 21:02:15,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=130666.66666666667, ans=0.05 2023-10-09 21:02:16,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=130666.66666666667, ans=0.125 2023-10-09 21:02:28,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=130713.33333333333, ans=0.2 2023-10-09 21:02:37,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.928e+02 2.179e+02 2.588e+02 4.693e+02, threshold=4.359e+02, percent-clipped=2.0 2023-10-09 21:02:44,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=130760.0, ans=0.125 2023-10-09 21:02:48,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=130760.0, ans=0.0 2023-10-09 21:02:55,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-10-09 21:03:08,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130853.33333333333, ans=0.1 2023-10-09 21:03:14,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=130853.33333333333, ans=0.125 2023-10-09 21:03:27,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130946.66666666667, ans=0.1 2023-10-09 21:03:32,943 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:03:33,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=130946.66666666667, ans=0.125 2023-10-09 21:03:44,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=130993.33333333333, ans=0.125 2023-10-09 21:03:44,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130993.33333333333, ans=0.125 2023-10-09 21:03:46,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130993.33333333333, ans=0.125 2023-10-09 21:03:49,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131040.0, ans=0.1 2023-10-09 21:03:49,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.94 vs. limit=15.0 2023-10-09 21:03:53,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=131040.0, ans=0.0 2023-10-09 21:03:56,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131040.0, ans=0.1 2023-10-09 21:04:15,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=12.0 2023-10-09 21:04:20,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=131133.33333333334, ans=0.2 2023-10-09 21:04:33,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.813e+02 2.082e+02 2.326e+02 3.136e+02, threshold=4.164e+02, percent-clipped=0.0 2023-10-09 21:04:53,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-09 21:05:14,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=131320.0, ans=0.0 2023-10-09 21:05:32,524 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.80 vs. limit=6.0 2023-10-09 21:06:22,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=22.5 2023-10-09 21:06:22,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=131600.0, ans=0.125 2023-10-09 21:06:26,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=131646.66666666666, ans=0.125 2023-10-09 21:06:26,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=12.0 2023-10-09 21:06:39,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.954e+02 2.283e+02 2.543e+02 4.639e+02, threshold=4.567e+02, percent-clipped=1.0 2023-10-09 21:06:48,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=131693.33333333334, ans=0.125 2023-10-09 21:07:14,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.62 vs. limit=10.0 2023-10-09 21:07:41,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-10-09 21:07:53,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=131973.33333333334, ans=0.125 2023-10-09 21:08:06,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=132020.0, ans=0.0 2023-10-09 21:08:13,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=132020.0, ans=0.125 2023-10-09 21:08:26,293 INFO [train.py:1031] (3/4) Epoch 3, batch 1000, loss[loss=0.2537, simple_loss=0.3339, pruned_loss=0.08681, over 16919.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3495, pruned_loss=0.1022, over 12952056.70 frames. ], batch size: 104, lr: 1.52e-02, grad_scale: 32.0 2023-10-09 21:08:39,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.798e+02 2.081e+02 2.413e+02 4.199e+02, threshold=4.163e+02, percent-clipped=0.0 2023-10-09 21:08:42,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2023-10-09 21:08:46,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=132160.0, ans=0.2 2023-10-09 21:08:51,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=132206.66666666666, ans=10.0 2023-10-09 21:09:00,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=132206.66666666666, ans=0.125 2023-10-09 21:09:09,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.24 vs. limit=10.0 2023-10-09 21:09:37,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=132393.33333333334, ans=0.1 2023-10-09 21:09:42,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=132393.33333333334, ans=0.0 2023-10-09 21:09:43,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=132393.33333333334, ans=0.2 2023-10-09 21:09:53,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=132440.0, ans=0.125 2023-10-09 21:10:05,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.44 vs. limit=22.5 2023-10-09 21:10:14,714 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.24 vs. limit=22.5 2023-10-09 21:10:29,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132580.0, ans=0.125 2023-10-09 21:10:29,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-10-09 21:10:36,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=132580.0, ans=22.5 2023-10-09 21:10:38,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.820e+02 2.013e+02 2.251e+02 3.213e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-09 21:10:58,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-09 21:11:13,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=132720.0, ans=0.2 2023-10-09 21:12:01,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=132906.66666666666, ans=0.125 2023-10-09 21:12:10,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=132906.66666666666, ans=0.125 2023-10-09 21:12:20,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=132953.33333333334, ans=0.125 2023-10-09 21:12:34,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=133000.0, ans=0.125 2023-10-09 21:12:50,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=133093.33333333334, ans=0.125 2023-10-09 21:12:50,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.853e+02 2.065e+02 2.454e+02 3.688e+02, threshold=4.130e+02, percent-clipped=0.0 2023-10-09 21:13:26,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.18 vs. limit=10.0 2023-10-09 21:14:21,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=133420.0, ans=0.125 2023-10-09 21:14:47,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.859e+02 2.175e+02 2.571e+02 4.066e+02, threshold=4.350e+02, percent-clipped=0.0 2023-10-09 21:15:12,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=133653.33333333334, ans=0.1 2023-10-09 21:15:37,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=133746.66666666666, ans=0.125 2023-10-09 21:16:36,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=133980.0, ans=0.0 2023-10-09 21:16:38,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=133980.0, ans=0.125 2023-10-09 21:16:38,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-10-09 21:16:43,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=133980.0, ans=0.125 2023-10-09 21:16:44,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133980.0, ans=0.1 2023-10-09 21:16:45,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=133980.0, ans=0.125 2023-10-09 21:16:50,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.835e+02 2.094e+02 2.277e+02 3.295e+02, threshold=4.189e+02, percent-clipped=0.0 2023-10-09 21:16:57,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=134026.66666666666, ans=0.125 2023-10-09 21:16:57,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=134026.66666666666, ans=0.125 2023-10-09 21:17:08,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.95 vs. limit=15.0 2023-10-09 21:17:09,416 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:17:24,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.63 vs. limit=22.5 2023-10-09 21:17:27,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134166.66666666666, ans=0.1 2023-10-09 21:17:35,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=134166.66666666666, ans=0.07 2023-10-09 21:18:04,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=134306.66666666666, ans=0.0 2023-10-09 21:18:08,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134306.66666666666, ans=0.1 2023-10-09 21:18:38,842 INFO [train.py:1031] (3/4) Epoch 3, batch 1500, loss[loss=0.2449, simple_loss=0.2924, pruned_loss=0.0987, over 12671.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3466, pruned_loss=0.1004, over 17361998.09 frames. ], batch size: 440, lr: 1.51e-02, grad_scale: 16.0 2023-10-09 21:18:41,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=134446.66666666666, ans=0.0 2023-10-09 21:18:51,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.804e+02 2.060e+02 2.368e+02 3.438e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-09 21:18:55,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.80 vs. limit=6.0 2023-10-09 21:19:17,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=134586.66666666666, ans=0.125 2023-10-09 21:19:28,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=134633.33333333334, ans=0.0 2023-10-09 21:19:31,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-10-09 21:19:58,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=134773.33333333334, ans=0.0 2023-10-09 21:19:58,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134773.33333333334, ans=0.1 2023-10-09 21:20:18,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=134820.0, ans=0.125 2023-10-09 21:20:24,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=134866.66666666666, ans=0.09899494936611666 2023-10-09 21:20:27,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=134866.66666666666, ans=0.2 2023-10-09 21:20:28,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=134866.66666666666, ans=0.125 2023-10-09 21:20:44,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.868e+02 2.118e+02 2.410e+02 3.223e+02, threshold=4.236e+02, percent-clipped=0.0 2023-10-09 21:20:45,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=134960.0, ans=0.125 2023-10-09 21:20:56,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=135006.66666666666, ans=0.2 2023-10-09 21:20:58,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=135006.66666666666, ans=0.0 2023-10-09 21:21:10,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=135053.33333333334, ans=0.0 2023-10-09 21:21:29,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135100.0, ans=0.1 2023-10-09 21:21:37,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=135146.66666666666, ans=0.125 2023-10-09 21:21:52,621 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:22:01,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=135240.0, ans=0.125 2023-10-09 21:22:40,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=135380.0, ans=0.2 2023-10-09 21:22:45,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.954e+02 2.268e+02 2.619e+02 4.378e+02, threshold=4.536e+02, percent-clipped=1.0 2023-10-09 21:22:47,428 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:22:54,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=135473.33333333334, ans=10.0 2023-10-09 21:22:58,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=135473.33333333334, ans=22.5 2023-10-09 21:23:02,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=135473.33333333334, ans=0.2 2023-10-09 21:23:05,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.63 vs. limit=12.0 2023-10-09 21:23:13,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=135520.0, ans=0.125 2023-10-09 21:23:25,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=135566.66666666666, ans=0.0 2023-10-09 21:23:30,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=135613.33333333334, ans=0.2 2023-10-09 21:23:40,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135660.0, ans=0.1 2023-10-09 21:23:44,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-10-09 21:23:47,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135660.0, ans=0.1 2023-10-09 21:24:13,262 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.18 vs. limit=15.0 2023-10-09 21:24:19,232 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:24:22,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=135800.0, ans=0.0 2023-10-09 21:24:28,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=135800.0, ans=0.2 2023-10-09 21:24:46,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.886e+02 2.087e+02 2.417e+02 3.075e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-09 21:25:06,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=135986.66666666666, ans=0.0 2023-10-09 21:25:22,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.02 vs. limit=15.0 2023-10-09 21:25:24,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=136033.33333333334, ans=15.0 2023-10-09 21:25:28,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=136033.33333333334, ans=0.0 2023-10-09 21:25:28,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2023-10-09 21:25:30,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136033.33333333334, ans=0.125 2023-10-09 21:25:30,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136033.33333333334, ans=0.0 2023-10-09 21:25:47,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-09 21:25:48,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136126.66666666666, ans=0.1 2023-10-09 21:26:03,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136173.33333333334, ans=0.1 2023-10-09 21:26:09,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=136220.0, ans=0.0 2023-10-09 21:26:17,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=136220.0, ans=0.0 2023-10-09 21:26:32,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.40 vs. limit=15.0 2023-10-09 21:26:41,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-10-09 21:26:44,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.886e+02 2.147e+02 2.562e+02 4.596e+02, threshold=4.294e+02, percent-clipped=1.0 2023-10-09 21:26:51,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=136360.0, ans=0.0 2023-10-09 21:26:59,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=136406.66666666666, ans=0.04949747468305833 2023-10-09 21:28:03,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=136640.0, ans=0.0 2023-10-09 21:28:27,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=136733.33333333334, ans=0.05 2023-10-09 21:28:32,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=136733.33333333334, ans=0.125 2023-10-09 21:28:37,410 INFO [train.py:1031] (3/4) Epoch 3, batch 2000, loss[loss=0.2278, simple_loss=0.3166, pruned_loss=0.06953, over 16824.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3467, pruned_loss=0.1005, over 20744097.53 frames. ], batch size: 67, lr: 1.50e-02, grad_scale: 32.0 2023-10-09 21:28:42,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=136780.0, ans=0.0 2023-10-09 21:28:43,478 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-10-09 21:28:50,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.938e+02 2.325e+02 2.669e+02 4.307e+02, threshold=4.649e+02, percent-clipped=1.0 2023-10-09 21:28:58,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136826.66666666666, ans=0.125 2023-10-09 21:29:06,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136873.33333333334, ans=0.1 2023-10-09 21:29:18,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=136920.0, ans=0.125 2023-10-09 21:29:24,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=136920.0, ans=0.1 2023-10-09 21:29:54,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=15.0 2023-10-09 21:30:18,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-10-09 21:30:38,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=137200.0, ans=0.125 2023-10-09 21:30:40,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=137200.0, ans=0.125 2023-10-09 21:30:53,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=137246.66666666666, ans=10.0 2023-10-09 21:30:53,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=137246.66666666666, ans=0.04949747468305833 2023-10-09 21:31:03,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137293.33333333334, ans=0.1 2023-10-09 21:31:04,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=137293.33333333334, ans=0.125 2023-10-09 21:31:07,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.905e+02 2.125e+02 2.441e+02 3.596e+02, threshold=4.250e+02, percent-clipped=0.0 2023-10-09 21:31:43,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137386.66666666666, ans=0.1 2023-10-09 21:31:49,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137433.33333333334, ans=0.1 2023-10-09 21:32:14,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=137526.66666666666, ans=0.125 2023-10-09 21:32:18,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=137526.66666666666, ans=10.0 2023-10-09 21:32:35,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=137573.33333333334, ans=0.125 2023-10-09 21:32:53,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=137666.66666666666, ans=0.125 2023-10-09 21:33:03,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137713.33333333334, ans=0.1 2023-10-09 21:33:07,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=137713.33333333334, ans=10.0 2023-10-09 21:33:10,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=137713.33333333334, ans=0.2 2023-10-09 21:33:15,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.915e+02 2.105e+02 2.534e+02 3.792e+02, threshold=4.210e+02, percent-clipped=0.0 2023-10-09 21:33:22,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-10-09 21:33:28,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=137806.66666666666, ans=0.2 2023-10-09 21:33:49,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137900.0, ans=0.1 2023-10-09 21:33:50,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=137900.0, ans=0.125 2023-10-09 21:34:12,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=137993.33333333334, ans=0.0 2023-10-09 21:34:17,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=138040.0, ans=0.125 2023-10-09 21:34:19,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=138040.0, ans=0.125 2023-10-09 21:34:56,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=138180.0, ans=0.125 2023-10-09 21:35:02,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.965e+02 2.340e+02 2.742e+02 3.516e+02, threshold=4.680e+02, percent-clipped=0.0 2023-10-09 21:35:12,179 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:35:19,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=15.0 2023-10-09 21:35:20,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=138273.33333333334, ans=0.0 2023-10-09 21:35:37,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138366.66666666666, ans=0.1 2023-10-09 21:35:37,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=138366.66666666666, ans=0.125 2023-10-09 21:35:41,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138366.66666666666, ans=0.1 2023-10-09 21:35:42,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=138366.66666666666, ans=0.125 2023-10-09 21:35:51,473 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:36:02,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-10-09 21:36:12,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=138506.66666666666, ans=0.125 2023-10-09 21:36:13,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138506.66666666666, ans=0.125 2023-10-09 21:36:13,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138506.66666666666, ans=0.1 2023-10-09 21:36:20,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=138553.33333333334, ans=0.04949747468305833 2023-10-09 21:36:31,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=138600.0, ans=0.125 2023-10-09 21:36:38,524 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-10-09 21:36:52,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.03 vs. limit=6.0 2023-10-09 21:36:54,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.892e+02 2.050e+02 2.380e+02 3.780e+02, threshold=4.101e+02, percent-clipped=0.0 2023-10-09 21:36:56,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=138693.33333333334, ans=0.125 2023-10-09 21:37:04,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=138740.0, ans=0.035 2023-10-09 21:37:09,962 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.61 vs. limit=22.5 2023-10-09 21:37:13,105 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:37:14,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=138786.66666666666, ans=0.125 2023-10-09 21:37:26,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=138833.33333333334, ans=0.125 2023-10-09 21:37:40,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=138880.0, ans=0.0 2023-10-09 21:37:52,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=138926.66666666666, ans=0.1 2023-10-09 21:37:58,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.45 vs. limit=15.0 2023-10-09 21:37:59,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138973.33333333334, ans=0.1 2023-10-09 21:38:12,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.21 vs. limit=22.5 2023-10-09 21:38:27,760 INFO [train.py:1031] (3/4) Epoch 3, batch 2500, loss[loss=0.2646, simple_loss=0.3413, pruned_loss=0.09389, over 16643.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3468, pruned_loss=0.1006, over 23426354.46 frames. ], batch size: 61, lr: 1.49e-02, grad_scale: 32.0 2023-10-09 21:38:40,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.944e+02 2.156e+02 2.452e+02 4.104e+02, threshold=4.312e+02, percent-clipped=1.0 2023-10-09 21:38:44,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=139160.0, ans=0.125 2023-10-09 21:39:04,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=139253.33333333334, ans=0.125 2023-10-09 21:39:11,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.25 vs. limit=10.0 2023-10-09 21:39:29,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=139346.66666666666, ans=0.125 2023-10-09 21:39:45,086 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:39:45,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.87 vs. limit=15.0 2023-10-09 21:39:51,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=139440.0, ans=0.2 2023-10-09 21:40:02,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=139486.66666666666, ans=0.125 2023-10-09 21:40:07,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.68 vs. limit=10.0 2023-10-09 21:40:10,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=139533.33333333334, ans=0.0 2023-10-09 21:40:28,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=139626.66666666666, ans=0.125 2023-10-09 21:40:29,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.821e+02 2.217e+02 2.583e+02 3.986e+02, threshold=4.434e+02, percent-clipped=0.0 2023-10-09 21:40:35,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=139626.66666666666, ans=0.0 2023-10-09 21:40:35,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-10-09 21:40:55,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=139720.0, ans=0.0 2023-10-09 21:41:07,290 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.59 vs. limit=6.0 2023-10-09 21:41:12,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=139813.33333333334, ans=0.125 2023-10-09 21:41:27,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=139860.0, ans=0.125 2023-10-09 21:41:28,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=139860.0, ans=0.0 2023-10-09 21:41:51,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.38 vs. limit=15.0 2023-10-09 21:42:22,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.998e+02 2.253e+02 2.659e+02 5.228e+02, threshold=4.506e+02, percent-clipped=1.0 2023-10-09 21:42:45,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=12.0 2023-10-09 21:43:13,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=15.0 2023-10-09 21:43:14,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.44 vs. limit=10.0 2023-10-09 21:43:19,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=140326.66666666666, ans=0.2 2023-10-09 21:43:39,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=140373.33333333334, ans=0.05 2023-10-09 21:44:16,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=140513.33333333334, ans=0.125 2023-10-09 21:44:22,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.829e+02 2.023e+02 2.246e+02 3.565e+02, threshold=4.046e+02, percent-clipped=0.0 2023-10-09 21:44:36,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=140606.66666666666, ans=0.125 2023-10-09 21:45:06,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=140700.0, ans=0.125 2023-10-09 21:45:09,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=140746.66666666666, ans=0.125 2023-10-09 21:45:11,267 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.13 vs. limit=15.0 2023-10-09 21:45:29,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.13 vs. limit=15.0 2023-10-09 21:45:45,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=140840.0, ans=0.125 2023-10-09 21:45:50,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=140886.66666666666, ans=0.125 2023-10-09 21:46:04,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=140933.33333333334, ans=0.2 2023-10-09 21:46:04,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=140933.33333333334, ans=0.125 2023-10-09 21:46:32,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.916e+02 2.204e+02 2.611e+02 3.602e+02, threshold=4.408e+02, percent-clipped=0.0 2023-10-09 21:47:02,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=141120.0, ans=0.125 2023-10-09 21:47:21,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=141213.33333333334, ans=0.0 2023-10-09 21:47:32,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=141260.0, ans=0.0 2023-10-09 21:47:37,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=141306.66666666666, ans=0.125 2023-10-09 21:47:53,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=141353.33333333334, ans=0.0 2023-10-09 21:47:53,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=141353.33333333334, ans=0.125 2023-10-09 21:48:03,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=141400.0, ans=0.0 2023-10-09 21:48:04,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=141400.0, ans=0.125 2023-10-09 21:48:05,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=15.0 2023-10-09 21:48:09,284 INFO [train.py:1031] (3/4) Epoch 3, batch 3000, loss[loss=0.2412, simple_loss=0.3261, pruned_loss=0.07815, over 16936.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3452, pruned_loss=0.09993, over 25478133.34 frames. ], batch size: 165, lr: 1.47e-02, grad_scale: 32.0 2023-10-09 21:48:11,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=141446.66666666666, ans=0.0 2023-10-09 21:48:21,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.772e+02 2.067e+02 2.351e+02 4.015e+02, threshold=4.134e+02, percent-clipped=0.0 2023-10-09 21:48:24,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=141493.33333333334, ans=0.125 2023-10-09 21:48:31,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-10-09 21:48:46,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=141586.66666666666, ans=0.0 2023-10-09 21:48:52,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=141633.33333333334, ans=0.125 2023-10-09 21:49:39,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141820.0, ans=0.125 2023-10-09 21:49:48,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=141866.66666666666, ans=0.0 2023-10-09 21:49:49,696 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:49:58,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.48 vs. limit=15.0 2023-10-09 21:49:59,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=141913.33333333334, ans=0.0 2023-10-09 21:50:08,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141913.33333333334, ans=0.125 2023-10-09 21:50:10,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=141913.33333333334, ans=0.125 2023-10-09 21:50:12,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=141913.33333333334, ans=0.1 2023-10-09 21:50:17,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.859e+02 2.097e+02 2.355e+02 3.302e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-09 21:50:29,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142006.66666666666, ans=0.0 2023-10-09 21:50:35,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=142006.66666666666, ans=0.125 2023-10-09 21:50:37,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=142053.33333333334, ans=0.2 2023-10-09 21:50:45,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=142053.33333333334, ans=0.125 2023-10-09 21:50:57,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.93 vs. limit=6.0 2023-10-09 21:51:00,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=142146.66666666666, ans=0.1 2023-10-09 21:51:16,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=142193.33333333334, ans=0.125 2023-10-09 21:51:43,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=142286.66666666666, ans=0.2 2023-10-09 21:51:50,220 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:52:00,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=142380.0, ans=0.0 2023-10-09 21:52:11,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.874e+02 2.180e+02 2.546e+02 4.109e+02, threshold=4.360e+02, percent-clipped=0.0 2023-10-09 21:52:17,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=142426.66666666666, ans=0.0 2023-10-09 21:52:39,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=142520.0, ans=0.0 2023-10-09 21:52:52,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=142566.66666666666, ans=0.125 2023-10-09 21:53:30,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=142706.66666666666, ans=0.2 2023-10-09 21:53:31,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=142706.66666666666, ans=0.125 2023-10-09 21:53:32,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=142706.66666666666, ans=0.0 2023-10-09 21:53:34,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=142706.66666666666, ans=0.0 2023-10-09 21:53:41,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=6.0 2023-10-09 21:53:48,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.22 vs. limit=15.0 2023-10-09 21:53:52,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=142800.0, ans=0.0 2023-10-09 21:53:56,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=142800.0, ans=0.125 2023-10-09 21:54:01,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=142800.0, ans=0.125 2023-10-09 21:54:05,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=142846.66666666666, ans=0.025 2023-10-09 21:54:08,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142846.66666666666, ans=0.1 2023-10-09 21:54:16,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.825e+02 2.054e+02 2.290e+02 3.558e+02, threshold=4.107e+02, percent-clipped=0.0 2023-10-09 21:54:24,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.57 vs. limit=22.5 2023-10-09 21:54:27,979 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:54:35,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=142986.66666666666, ans=0.2 2023-10-09 21:55:12,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143126.66666666666, ans=0.1 2023-10-09 21:55:34,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=143220.0, ans=0.07 2023-10-09 21:55:44,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=143220.0, ans=0.125 2023-10-09 21:55:47,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2023-10-09 21:55:49,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=143266.66666666666, ans=0.0 2023-10-09 21:56:04,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=143313.33333333334, ans=0.2 2023-10-09 21:56:10,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.980e+02 2.273e+02 2.671e+02 4.418e+02, threshold=4.547e+02, percent-clipped=1.0 2023-10-09 21:56:12,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-10-09 21:56:15,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.49 vs. limit=22.5 2023-10-09 21:56:27,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=143406.66666666666, ans=0.2 2023-10-09 21:56:28,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=143406.66666666666, ans=0.5 2023-10-09 21:56:30,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=143453.33333333334, ans=0.125 2023-10-09 21:56:31,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=143453.33333333334, ans=0.0 2023-10-09 21:56:31,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.15 vs. limit=15.0 2023-10-09 21:56:39,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=143500.0, ans=0.125 2023-10-09 21:56:43,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=143500.0, ans=0.0 2023-10-09 21:56:46,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=143500.0, ans=0.125 2023-10-09 21:56:52,386 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-10-09 21:56:56,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=143546.66666666666, ans=0.0 2023-10-09 21:56:59,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=143546.66666666666, ans=0.125 2023-10-09 21:57:20,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143640.0, ans=0.1 2023-10-09 21:57:20,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=15.0 2023-10-09 21:57:35,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-09 21:57:40,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.56 vs. limit=15.0 2023-10-09 21:57:41,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-10-09 21:57:47,848 INFO [train.py:1031] (3/4) Epoch 3, batch 3500, loss[loss=0.272, simple_loss=0.3442, pruned_loss=0.09989, over 16905.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3442, pruned_loss=0.09925, over 27102722.12 frames. ], batch size: 110, lr: 1.46e-02, grad_scale: 32.0 2023-10-09 21:57:56,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.46 vs. limit=6.0 2023-10-09 21:58:02,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.900e+02 2.131e+02 2.507e+02 3.406e+02, threshold=4.262e+02, percent-clipped=0.0 2023-10-09 21:58:11,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.55 vs. limit=10.0 2023-10-09 21:58:21,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=143920.0, ans=0.2 2023-10-09 21:58:28,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=143920.0, ans=0.125 2023-10-09 21:58:28,850 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-10-09 21:58:42,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=144013.33333333334, ans=15.0 2023-10-09 21:58:50,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.84 vs. limit=15.0 2023-10-09 21:59:29,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144153.33333333334, ans=0.1 2023-10-09 21:59:40,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.79 vs. limit=10.0 2023-10-09 21:59:41,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=144200.0, ans=0.0 2023-10-09 22:00:03,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.868e+02 2.114e+02 2.596e+02 3.837e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-09 22:00:22,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=144340.0, ans=0.125 2023-10-09 22:00:31,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=144386.66666666666, ans=0.0 2023-10-09 22:00:31,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=144386.66666666666, ans=0.0 2023-10-09 22:00:43,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=144433.33333333334, ans=0.125 2023-10-09 22:00:54,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=144480.0, ans=0.5 2023-10-09 22:01:14,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=144573.33333333334, ans=0.0 2023-10-09 22:01:16,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=22.5 2023-10-09 22:01:28,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=144620.0, ans=0.125 2023-10-09 22:01:45,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=144666.66666666666, ans=0.0 2023-10-09 22:01:58,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=144760.0, ans=0.125 2023-10-09 22:02:01,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.929e+02 2.281e+02 2.631e+02 4.263e+02, threshold=4.563e+02, percent-clipped=1.0 2023-10-09 22:02:02,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=144760.0, ans=0.125 2023-10-09 22:02:12,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=144806.66666666666, ans=0.0 2023-10-09 22:02:20,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=144806.66666666666, ans=0.125 2023-10-09 22:02:51,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=144946.66666666666, ans=0.0 2023-10-09 22:02:55,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.98 vs. limit=15.0 2023-10-09 22:03:03,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=144993.33333333334, ans=0.125 2023-10-09 22:03:04,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=144993.33333333334, ans=0.04949747468305833 2023-10-09 22:03:18,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=145040.0, ans=0.2 2023-10-09 22:03:29,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145086.66666666666, ans=0.1 2023-10-09 22:03:33,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=145086.66666666666, ans=0.09899494936611666 2023-10-09 22:03:37,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=145086.66666666666, ans=0.125 2023-10-09 22:03:42,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=145133.33333333334, ans=0.125 2023-10-09 22:03:42,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=145133.33333333334, ans=0.5 2023-10-09 22:04:01,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=145180.0, ans=0.125 2023-10-09 22:04:06,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.904e+02 2.136e+02 2.407e+02 3.141e+02, threshold=4.272e+02, percent-clipped=0.0 2023-10-09 22:04:20,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=145273.33333333334, ans=10.0 2023-10-09 22:04:21,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-10-09 22:04:35,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-10-09 22:05:04,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145460.0, ans=0.1 2023-10-09 22:05:23,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=145506.66666666666, ans=0.2 2023-10-09 22:05:27,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=145553.33333333334, ans=0.0 2023-10-09 22:05:39,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=145600.0, ans=0.0 2023-10-09 22:05:42,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=22.5 2023-10-09 22:06:03,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=145693.33333333334, ans=0.125 2023-10-09 22:06:04,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.872e+02 2.136e+02 2.488e+02 3.570e+02, threshold=4.272e+02, percent-clipped=0.0 2023-10-09 22:06:07,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=145693.33333333334, ans=0.125 2023-10-09 22:06:09,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=12.0 2023-10-09 22:06:24,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=145786.66666666666, ans=0.125 2023-10-09 22:06:29,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=145786.66666666666, ans=0.0 2023-10-09 22:06:31,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=145833.33333333334, ans=0.035 2023-10-09 22:07:04,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=145926.66666666666, ans=0.04949747468305833 2023-10-09 22:07:15,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=145973.33333333334, ans=0.5 2023-10-09 22:07:28,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.95 vs. limit=10.0 2023-10-09 22:07:40,675 INFO [train.py:1031] (3/4) Epoch 3, batch 4000, loss[loss=0.3122, simple_loss=0.3788, pruned_loss=0.1228, over 16098.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3433, pruned_loss=0.09868, over 28370133.67 frames. ], batch size: 296, lr: 1.45e-02, grad_scale: 32.0 2023-10-09 22:07:59,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.880e+02 2.112e+02 2.537e+02 3.409e+02, threshold=4.224e+02, percent-clipped=0.0 2023-10-09 22:08:13,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=146206.66666666666, ans=0.0 2023-10-09 22:08:13,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.89 vs. limit=10.0 2023-10-09 22:08:23,522 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:08:29,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=146253.33333333334, ans=0.2 2023-10-09 22:08:47,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=146346.66666666666, ans=0.125 2023-10-09 22:09:13,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.62 vs. limit=15.0 2023-10-09 22:09:24,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=146486.66666666666, ans=0.0 2023-10-09 22:09:26,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=146486.66666666666, ans=0.0 2023-10-09 22:09:32,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=146533.33333333334, ans=0.125 2023-10-09 22:09:35,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146533.33333333334, ans=0.1 2023-10-09 22:09:38,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.67 vs. limit=12.0 2023-10-09 22:09:54,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.935e+02 2.190e+02 2.491e+02 3.231e+02, threshold=4.381e+02, percent-clipped=0.0 2023-10-09 22:10:19,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=146720.0, ans=0.125 2023-10-09 22:10:19,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=146720.0, ans=0.125 2023-10-09 22:10:20,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=146720.0, ans=0.0 2023-10-09 22:10:24,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=146720.0, ans=0.2 2023-10-09 22:10:30,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=146766.66666666666, ans=0.0 2023-10-09 22:10:31,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=146766.66666666666, ans=0.0 2023-10-09 22:10:40,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=146813.33333333334, ans=0.125 2023-10-09 22:10:47,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=146813.33333333334, ans=0.125 2023-10-09 22:10:47,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=146813.33333333334, ans=0.09899494936611666 2023-10-09 22:11:10,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.19 vs. limit=22.5 2023-10-09 22:11:11,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.37 vs. limit=15.0 2023-10-09 22:11:18,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.87 vs. limit=15.0 2023-10-09 22:11:37,714 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:12:05,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.882e+02 2.150e+02 2.744e+02 3.738e+02, threshold=4.300e+02, percent-clipped=0.0 2023-10-09 22:12:07,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-10-09 22:12:08,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=147093.33333333334, ans=0.0 2023-10-09 22:12:32,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147186.66666666666, ans=0.0 2023-10-09 22:12:39,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=147233.33333333334, ans=0.125 2023-10-09 22:12:51,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=147280.0, ans=0.2 2023-10-09 22:13:15,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.08 vs. limit=15.0 2023-10-09 22:13:17,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=147373.33333333334, ans=0.125 2023-10-09 22:13:23,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=147420.0, ans=0.07 2023-10-09 22:13:25,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=147420.0, ans=0.0 2023-10-09 22:13:29,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=147420.0, ans=0.0 2023-10-09 22:13:39,110 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:13:47,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-10-09 22:13:48,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147513.33333333334, ans=0.1 2023-10-09 22:13:52,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=147513.33333333334, ans=0.0 2023-10-09 22:14:01,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.894e+02 2.086e+02 2.449e+02 4.266e+02, threshold=4.172e+02, percent-clipped=0.0 2023-10-09 22:14:33,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=147700.0, ans=0.125 2023-10-09 22:14:37,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=147700.0, ans=0.125 2023-10-09 22:14:50,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=147746.66666666666, ans=0.125 2023-10-09 22:14:54,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=147746.66666666666, ans=0.0 2023-10-09 22:14:54,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.15 vs. limit=15.0 2023-10-09 22:15:05,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=147793.33333333334, ans=0.125 2023-10-09 22:15:06,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147793.33333333334, ans=0.125 2023-10-09 22:15:15,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147840.0, ans=0.125 2023-10-09 22:15:16,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=147840.0, ans=0.125 2023-10-09 22:15:25,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=147886.66666666666, ans=0.125 2023-10-09 22:15:41,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147933.33333333334, ans=0.1 2023-10-09 22:16:02,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 2.131e+02 2.486e+02 2.958e+02 3.911e+02, threshold=4.973e+02, percent-clipped=0.0 2023-10-09 22:16:07,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.61 vs. limit=15.0 2023-10-09 22:16:20,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=148073.33333333334, ans=0.125 2023-10-09 22:16:37,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.27 vs. limit=6.0 2023-10-09 22:16:39,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=148166.66666666666, ans=0.125 2023-10-09 22:16:44,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=148166.66666666666, ans=0.0 2023-10-09 22:16:44,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=15.0 2023-10-09 22:16:53,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-10-09 22:17:42,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-10-09 22:17:51,595 INFO [train.py:1031] (3/4) Epoch 3, batch 4500, loss[loss=0.2613, simple_loss=0.341, pruned_loss=0.0908, over 16980.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3432, pruned_loss=0.09826, over 29344806.10 frames. ], batch size: 77, lr: 1.44e-02, grad_scale: 32.0 2023-10-09 22:17:52,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2023-10-09 22:18:06,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.781e+02 1.940e+02 2.181e+02 3.401e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-09 22:18:11,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=148493.33333333334, ans=0.0 2023-10-09 22:18:16,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=148540.0, ans=0.0 2023-10-09 22:18:23,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.30 vs. limit=15.0 2023-10-09 22:18:38,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=148633.33333333334, ans=0.0 2023-10-09 22:18:42,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=148633.33333333334, ans=0.125 2023-10-09 22:18:50,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=148680.0, ans=0.125 2023-10-09 22:19:18,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148820.0, ans=0.1 2023-10-09 22:19:33,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=148866.66666666666, ans=0.125 2023-10-09 22:19:55,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.908e+02 2.138e+02 2.461e+02 3.826e+02, threshold=4.275e+02, percent-clipped=0.0 2023-10-09 22:20:03,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=149006.66666666666, ans=0.125 2023-10-09 22:20:06,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=149006.66666666666, ans=0.0 2023-10-09 22:20:06,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-10-09 22:20:10,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2023-10-09 22:20:16,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=149053.33333333334, ans=0.5 2023-10-09 22:20:53,319 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-09 22:21:01,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=149240.0, ans=0.125 2023-10-09 22:21:23,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=149333.33333333334, ans=0.125 2023-10-09 22:21:34,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=149380.0, ans=0.025 2023-10-09 22:21:36,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=149380.0, ans=0.0 2023-10-09 22:21:49,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.900e+02 2.146e+02 2.556e+02 4.640e+02, threshold=4.292e+02, percent-clipped=1.0 2023-10-09 22:22:18,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=149566.66666666666, ans=0.0 2023-10-09 22:22:38,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=149660.0, ans=0.2 2023-10-09 22:22:42,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=149660.0, ans=0.2 2023-10-09 22:22:53,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=149706.66666666666, ans=0.125 2023-10-09 22:23:02,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149753.33333333334, ans=0.125 2023-10-09 22:23:24,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=149846.66666666666, ans=0.2 2023-10-09 22:23:31,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=149846.66666666666, ans=0.5 2023-10-09 22:23:41,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.878e+02 2.109e+02 2.447e+02 3.980e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-09 22:23:49,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149940.0, ans=0.1 2023-10-09 22:23:53,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=149940.0, ans=0.125 2023-10-09 22:23:58,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=149940.0, ans=0.2 2023-10-09 22:24:10,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=149986.66666666666, ans=0.1 2023-10-09 22:24:12,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149986.66666666666, ans=0.1 2023-10-09 22:24:30,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.26 vs. limit=15.0 2023-10-09 22:24:31,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=150080.0, ans=0.0 2023-10-09 22:25:24,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-10-09 22:25:38,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.754e+02 1.888e+02 2.100e+02 3.041e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-09 22:26:00,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=150453.33333333334, ans=0.125 2023-10-09 22:26:12,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=150500.0, ans=0.1 2023-10-09 22:26:34,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=150546.66666666666, ans=0.0 2023-10-09 22:26:42,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-10-09 22:26:55,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=150640.0, ans=0.125 2023-10-09 22:27:21,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=150780.0, ans=0.125 2023-10-09 22:27:22,652 INFO [train.py:1031] (3/4) Epoch 3, batch 5000, loss[loss=0.3769, simple_loss=0.4019, pruned_loss=0.176, over 15614.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3428, pruned_loss=0.0983, over 30118728.62 frames. ], batch size: 350, lr: 1.43e-02, grad_scale: 32.0 2023-10-09 22:27:27,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=150780.0, ans=0.0 2023-10-09 22:27:28,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=12.0 2023-10-09 22:27:39,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.945e+02 2.133e+02 2.539e+02 3.792e+02, threshold=4.266e+02, percent-clipped=1.0 2023-10-09 22:27:47,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.00 vs. limit=15.0 2023-10-09 22:27:58,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=150920.0, ans=0.125 2023-10-09 22:28:22,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151013.33333333334, ans=0.1 2023-10-09 22:28:25,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.97 vs. limit=22.5 2023-10-09 22:28:26,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=151013.33333333334, ans=0.125 2023-10-09 22:28:34,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=151060.0, ans=0.0 2023-10-09 22:28:40,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151060.0, ans=0.1 2023-10-09 22:29:14,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151200.0, ans=0.125 2023-10-09 22:29:14,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=151200.0, ans=0.125 2023-10-09 22:29:24,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=151246.66666666666, ans=0.95 2023-10-09 22:29:37,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.851e+02 2.076e+02 2.455e+02 3.818e+02, threshold=4.153e+02, percent-clipped=0.0 2023-10-09 22:29:45,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=151340.0, ans=15.0 2023-10-09 22:29:49,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-10-09 22:30:06,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=151433.33333333334, ans=0.0 2023-10-09 22:30:09,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=151433.33333333334, ans=0.0 2023-10-09 22:30:10,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=151433.33333333334, ans=0.125 2023-10-09 22:30:29,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=151526.66666666666, ans=0.1 2023-10-09 22:30:35,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=151526.66666666666, ans=6.0 2023-10-09 22:30:35,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-10-09 22:30:52,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151620.0, ans=0.1 2023-10-09 22:30:56,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=151620.0, ans=0.125 2023-10-09 22:31:14,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=151713.33333333334, ans=0.0 2023-10-09 22:31:23,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.879e+02 2.163e+02 2.658e+02 3.647e+02, threshold=4.327e+02, percent-clipped=0.0 2023-10-09 22:31:33,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=151806.66666666666, ans=0.125 2023-10-09 22:31:40,714 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:31:45,194 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:31:51,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151900.0, ans=0.125 2023-10-09 22:32:06,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=151946.66666666666, ans=0.2 2023-10-09 22:32:10,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=151946.66666666666, ans=0.0 2023-10-09 22:32:11,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.51 vs. limit=22.5 2023-10-09 22:32:49,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=152133.33333333334, ans=0.2 2023-10-09 22:33:02,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.08 vs. limit=15.0 2023-10-09 22:33:11,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=152226.66666666666, ans=0.0 2023-10-09 22:33:17,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.879e+02 2.023e+02 2.336e+02 3.757e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-09 22:33:26,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=152273.33333333334, ans=0.0 2023-10-09 22:33:35,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=152320.0, ans=10.0 2023-10-09 22:34:07,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-10-09 22:34:19,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=152460.0, ans=0.125 2023-10-09 22:34:38,644 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:34:49,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=152600.0, ans=0.07 2023-10-09 22:35:07,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.846e+02 2.041e+02 2.254e+02 3.753e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-09 22:35:24,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=152786.66666666666, ans=0.0 2023-10-09 22:35:37,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=152833.33333333334, ans=0.125 2023-10-09 22:35:44,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=152880.0, ans=0.02 2023-10-09 22:35:47,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152880.0, ans=0.125 2023-10-09 22:36:13,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=152973.33333333334, ans=15.0 2023-10-09 22:36:16,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2023-10-09 22:36:18,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152973.33333333334, ans=0.1 2023-10-09 22:36:19,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2023-10-09 22:36:41,133 INFO [train.py:1031] (3/4) Epoch 3, batch 5500, loss[loss=0.2537, simple_loss=0.3287, pruned_loss=0.08933, over 15923.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.342, pruned_loss=0.09733, over 30730310.24 frames. ], batch size: 43, lr: 1.42e-02, grad_scale: 32.0 2023-10-09 22:36:48,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2023-10-09 22:36:53,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=153160.0, ans=0.125 2023-10-09 22:36:54,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=153160.0, ans=0.0 2023-10-09 22:36:55,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.816e+02 2.162e+02 2.499e+02 3.262e+02, threshold=4.324e+02, percent-clipped=0.0 2023-10-09 22:36:57,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153160.0, ans=0.1 2023-10-09 22:36:59,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=153160.0, ans=0.0 2023-10-09 22:37:10,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=153206.66666666666, ans=0.0 2023-10-09 22:37:36,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=153346.66666666666, ans=0.125 2023-10-09 22:37:51,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-10-09 22:38:03,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=153440.0, ans=0.0 2023-10-09 22:38:18,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=153533.33333333334, ans=0.125 2023-10-09 22:38:23,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=153533.33333333334, ans=0.125 2023-10-09 22:38:25,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=153533.33333333334, ans=0.125 2023-10-09 22:38:25,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=153533.33333333334, ans=0.0 2023-10-09 22:38:39,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=22.5 2023-10-09 22:38:42,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.855e+02 2.043e+02 2.293e+02 3.025e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-09 22:39:37,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=153860.0, ans=0.125 2023-10-09 22:39:44,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=153906.66666666666, ans=0.0 2023-10-09 22:39:56,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153906.66666666666, ans=0.1 2023-10-09 22:39:59,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=153953.33333333334, ans=0.0 2023-10-09 22:40:01,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=153953.33333333334, ans=0.125 2023-10-09 22:40:01,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=153953.33333333334, ans=0.125 2023-10-09 22:40:02,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153953.33333333334, ans=0.1 2023-10-09 22:40:16,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.27 vs. limit=22.5 2023-10-09 22:40:21,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=154046.66666666666, ans=0.0 2023-10-09 22:40:29,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-10-09 22:40:37,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.841e+02 2.148e+02 2.341e+02 3.683e+02, threshold=4.296e+02, percent-clipped=0.0 2023-10-09 22:40:37,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=154093.33333333334, ans=0.125 2023-10-09 22:40:41,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-10-09 22:40:47,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=154140.0, ans=0.0 2023-10-09 22:40:55,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154186.66666666666, ans=0.1 2023-10-09 22:41:16,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=154280.0, ans=0.0 2023-10-09 22:41:34,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=154326.66666666666, ans=0.0 2023-10-09 22:42:04,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=154466.66666666666, ans=0.04949747468305833 2023-10-09 22:42:06,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=154466.66666666666, ans=0.125 2023-10-09 22:42:11,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=154466.66666666666, ans=0.125 2023-10-09 22:42:27,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=154560.0, ans=0.125 2023-10-09 22:42:30,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=154560.0, ans=0.0 2023-10-09 22:42:31,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=154560.0, ans=0.125 2023-10-09 22:42:32,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.846e+02 2.006e+02 2.394e+02 3.432e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-09 22:42:47,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=154606.66666666666, ans=0.0 2023-10-09 22:43:08,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=154700.0, ans=0.2 2023-10-09 22:43:25,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154793.33333333334, ans=0.1 2023-10-09 22:43:35,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2023-10-09 22:43:45,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=154840.0, ans=0.0 2023-10-09 22:43:52,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=154886.66666666666, ans=0.2 2023-10-09 22:43:58,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-10-09 22:44:10,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-09 22:44:16,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154980.0, ans=0.1 2023-10-09 22:44:24,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=155026.66666666666, ans=0.015 2023-10-09 22:44:26,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.748e+02 1.963e+02 2.215e+02 3.130e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-09 22:44:27,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.39 vs. limit=10.0 2023-10-09 22:44:31,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=155026.66666666666, ans=0.0 2023-10-09 22:44:57,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.21 vs. limit=22.5 2023-10-09 22:45:21,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=15.0 2023-10-09 22:45:22,599 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:45:23,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=155260.0, ans=0.125 2023-10-09 22:45:36,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.93 vs. limit=22.5 2023-10-09 22:45:40,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155306.66666666666, ans=0.1 2023-10-09 22:45:59,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=155400.0, ans=0.0 2023-10-09 22:46:04,102 INFO [train.py:1031] (3/4) Epoch 3, batch 6000, loss[loss=0.2444, simple_loss=0.3191, pruned_loss=0.08488, over 15758.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3421, pruned_loss=0.09751, over 31186774.43 frames. ], batch size: 35, lr: 1.41e-02, grad_scale: 64.0 2023-10-09 22:46:10,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=155446.66666666666, ans=0.2 2023-10-09 22:46:20,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.913e+02 2.142e+02 2.403e+02 3.382e+02, threshold=4.283e+02, percent-clipped=0.0 2023-10-09 22:46:29,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155540.0, ans=0.1 2023-10-09 22:46:40,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155586.66666666666, ans=0.125 2023-10-09 22:46:42,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=155586.66666666666, ans=0.125 2023-10-09 22:46:55,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155633.33333333334, ans=0.125 2023-10-09 22:46:57,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155633.33333333334, ans=0.1 2023-10-09 22:47:06,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155680.0, ans=0.125 2023-10-09 22:47:13,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=155726.66666666666, ans=0.0 2023-10-09 22:47:17,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=155726.66666666666, ans=0.95 2023-10-09 22:47:32,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=155820.0, ans=0.125 2023-10-09 22:47:40,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=155820.0, ans=0.025 2023-10-09 22:47:57,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155913.33333333334, ans=0.125 2023-10-09 22:47:59,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155913.33333333334, ans=0.125 2023-10-09 22:48:08,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.821e+02 2.040e+02 2.331e+02 3.202e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-09 22:48:41,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=156100.0, ans=0.125 2023-10-09 22:48:48,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=156146.66666666666, ans=0.0 2023-10-09 22:49:02,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=156193.33333333334, ans=0.125 2023-10-09 22:49:37,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=156333.33333333334, ans=0.5 2023-10-09 22:49:44,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=156380.0, ans=0.125 2023-10-09 22:49:52,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=156380.0, ans=0.125 2023-10-09 22:49:55,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=156426.66666666666, ans=0.125 2023-10-09 22:49:58,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.870e+02 2.048e+02 2.337e+02 3.272e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-09 22:50:00,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=156426.66666666666, ans=0.0 2023-10-09 22:50:13,403 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.49 vs. limit=15.0 2023-10-09 22:50:27,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=156566.66666666666, ans=0.125 2023-10-09 22:50:30,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.82 vs. limit=12.0 2023-10-09 22:50:35,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156566.66666666666, ans=0.1 2023-10-09 22:50:45,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=156613.33333333334, ans=0.125 2023-10-09 22:51:01,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=156706.66666666666, ans=0.125 2023-10-09 22:51:05,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=156706.66666666666, ans=0.2 2023-10-09 22:51:30,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=156800.0, ans=0.5 2023-10-09 22:51:30,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=156800.0, ans=0.025 2023-10-09 22:51:32,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156800.0, ans=0.1 2023-10-09 22:51:37,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.10 vs. limit=6.0 2023-10-09 22:51:39,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=156846.66666666666, ans=0.125 2023-10-09 22:51:40,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.13 vs. limit=22.5 2023-10-09 22:51:44,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=156846.66666666666, ans=0.125 2023-10-09 22:51:53,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.832e+02 2.055e+02 2.263e+02 4.156e+02, threshold=4.110e+02, percent-clipped=1.0 2023-10-09 22:51:56,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.96 vs. limit=22.5 2023-10-09 22:52:18,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=156986.66666666666, ans=0.0 2023-10-09 22:52:23,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=157033.33333333334, ans=0.0 2023-10-09 22:52:33,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157080.0, ans=0.125 2023-10-09 22:52:57,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=157126.66666666666, ans=0.0 2023-10-09 22:53:07,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.98 vs. limit=10.0 2023-10-09 22:53:22,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-10-09 22:53:37,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=157266.66666666666, ans=0.2 2023-10-09 22:53:40,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=157313.33333333334, ans=0.025 2023-10-09 22:53:40,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157313.33333333334, ans=0.1 2023-10-09 22:53:42,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-10-09 22:53:54,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.241e+02 1.853e+02 2.024e+02 2.504e+02 3.451e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-09 22:53:57,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=157360.0, ans=0.2 2023-10-09 22:54:06,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=157406.66666666666, ans=0.125 2023-10-09 22:54:10,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=157406.66666666666, ans=0.04949747468305833 2023-10-09 22:54:15,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=157453.33333333334, ans=0.0 2023-10-09 22:54:15,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=157453.33333333334, ans=0.125 2023-10-09 22:54:35,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=157546.66666666666, ans=0.125 2023-10-09 22:54:52,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=157593.33333333334, ans=0.125 2023-10-09 22:55:00,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-10-09 22:55:06,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=157640.0, ans=0.125 2023-10-09 22:55:20,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-10-09 22:55:20,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.09 vs. limit=10.0 2023-10-09 22:55:20,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157733.33333333334, ans=0.125 2023-10-09 22:55:22,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=157733.33333333334, ans=0.125 2023-10-09 22:55:28,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=157733.33333333334, ans=0.125 2023-10-09 22:55:29,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-10-09 22:55:32,342 INFO [train.py:1031] (3/4) Epoch 3, batch 6500, loss[loss=0.2916, simple_loss=0.3557, pruned_loss=0.1137, over 16289.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3421, pruned_loss=0.09734, over 31556144.34 frames. ], batch size: 50, lr: 1.40e-02, grad_scale: 32.0 2023-10-09 22:55:48,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=157826.66666666666, ans=0.125 2023-10-09 22:55:49,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2023-10-09 22:55:53,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.814e+02 2.030e+02 2.227e+02 3.213e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-09 22:56:23,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=157920.0, ans=0.2 2023-10-09 22:56:24,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.72 vs. limit=10.0 2023-10-09 22:56:27,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=157966.66666666666, ans=0.2 2023-10-09 22:56:36,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=157966.66666666666, ans=0.1 2023-10-09 22:56:42,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-10-09 22:56:44,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=158013.33333333334, ans=0.125 2023-10-09 22:56:45,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158013.33333333334, ans=0.1 2023-10-09 22:57:07,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-10-09 22:57:08,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=158106.66666666666, ans=0.125 2023-10-09 22:57:08,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158106.66666666666, ans=0.1 2023-10-09 22:57:09,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=158106.66666666666, ans=0.125 2023-10-09 22:57:20,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=158153.33333333334, ans=0.0 2023-10-09 22:57:20,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=158153.33333333334, ans=0.2 2023-10-09 22:57:21,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=158153.33333333334, ans=0.125 2023-10-09 22:57:41,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158246.66666666666, ans=0.125 2023-10-09 22:57:57,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.791e+02 2.039e+02 2.436e+02 4.084e+02, threshold=4.078e+02, percent-clipped=1.0 2023-10-09 22:58:08,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=158340.0, ans=0.0 2023-10-09 22:58:19,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=158386.66666666666, ans=0.0 2023-10-09 22:58:19,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=158386.66666666666, ans=0.125 2023-10-09 22:58:55,171 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-09 22:59:00,557 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-10-09 22:59:09,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=158620.0, ans=0.125 2023-10-09 22:59:28,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=158713.33333333334, ans=0.125 2023-10-09 22:59:32,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=158713.33333333334, ans=0.0 2023-10-09 22:59:39,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158760.0, ans=0.1 2023-10-09 22:59:43,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.877e+02 2.135e+02 2.495e+02 3.687e+02, threshold=4.269e+02, percent-clipped=0.0 2023-10-09 22:59:57,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=158806.66666666666, ans=0.125 2023-10-09 23:00:02,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=158853.33333333334, ans=0.035 2023-10-09 23:00:02,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=158853.33333333334, ans=0.0 2023-10-09 23:00:07,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=158853.33333333334, ans=0.0 2023-10-09 23:00:35,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=158993.33333333334, ans=0.0 2023-10-09 23:00:47,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=15.0 2023-10-09 23:00:52,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=159040.0, ans=0.125 2023-10-09 23:01:11,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159133.33333333334, ans=0.1 2023-10-09 23:01:20,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=159133.33333333334, ans=0.125 2023-10-09 23:01:27,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=159180.0, ans=0.0 2023-10-09 23:01:28,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=159180.0, ans=15.0 2023-10-09 23:01:43,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=159226.66666666666, ans=0.125 2023-10-09 23:01:45,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=159226.66666666666, ans=6.0 2023-10-09 23:01:50,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.831e+02 2.052e+02 2.535e+02 4.887e+02, threshold=4.104e+02, percent-clipped=4.0 2023-10-09 23:02:01,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=159273.33333333334, ans=0.125 2023-10-09 23:02:11,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=159320.0, ans=10.0 2023-10-09 23:02:38,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=159413.33333333334, ans=0.0 2023-10-09 23:03:05,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2023-10-09 23:03:09,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=159553.33333333334, ans=0.0 2023-10-09 23:03:14,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-10-09 23:03:18,338 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:03:22,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=159600.0, ans=0.125 2023-10-09 23:03:30,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=159646.66666666666, ans=0.125 2023-10-09 23:03:31,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=159646.66666666666, ans=0.0 2023-10-09 23:03:42,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=159693.33333333334, ans=0.125 2023-10-09 23:03:44,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.730e+02 1.923e+02 2.193e+02 3.245e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-09 23:03:48,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=159693.33333333334, ans=0.125 2023-10-09 23:04:04,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=159786.66666666666, ans=0.125 2023-10-09 23:04:06,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=159786.66666666666, ans=0.125 2023-10-09 23:04:09,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.68 vs. limit=22.5 2023-10-09 23:04:23,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=159880.0, ans=0.125 2023-10-09 23:04:27,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159880.0, ans=0.0 2023-10-09 23:04:29,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159880.0, ans=0.125 2023-10-09 23:04:38,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=159926.66666666666, ans=0.125 2023-10-09 23:04:43,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159926.66666666666, ans=0.1 2023-10-09 23:04:53,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=159973.33333333334, ans=0.125 2023-10-09 23:05:03,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=160020.0, ans=0.125 2023-10-09 23:05:04,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=160020.0, ans=0.125 2023-10-09 23:05:16,635 INFO [train.py:1031] (3/4) Epoch 3, batch 7000, loss[loss=0.2699, simple_loss=0.3432, pruned_loss=0.09835, over 15604.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.342, pruned_loss=0.09684, over 31842271.25 frames. ], batch size: 35, lr: 1.39e-02, grad_scale: 32.0 2023-10-09 23:05:19,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-10-09 23:05:36,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.985e+02 2.202e+02 2.515e+02 3.660e+02, threshold=4.404e+02, percent-clipped=0.0 2023-10-09 23:05:42,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=160206.66666666666, ans=0.125 2023-10-09 23:05:49,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=160206.66666666666, ans=0.125 2023-10-09 23:06:00,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160253.33333333334, ans=0.125 2023-10-09 23:06:06,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-10-09 23:06:41,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=160440.0, ans=0.125 2023-10-09 23:06:49,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=160486.66666666666, ans=0.125 2023-10-09 23:06:54,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=160486.66666666666, ans=0.0 2023-10-09 23:06:56,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160486.66666666666, ans=0.1 2023-10-09 23:07:25,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.893e+02 2.182e+02 2.571e+02 3.595e+02, threshold=4.365e+02, percent-clipped=0.0 2023-10-09 23:07:36,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=160673.33333333334, ans=15.0 2023-10-09 23:07:43,454 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.28 vs. limit=10.0 2023-10-09 23:07:50,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160720.0, ans=0.125 2023-10-09 23:07:53,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=160720.0, ans=0.0 2023-10-09 23:08:15,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-10-09 23:08:16,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160860.0, ans=0.125 2023-10-09 23:08:33,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=160906.66666666666, ans=0.1 2023-10-09 23:08:47,840 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:08:49,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=161000.0, ans=0.2 2023-10-09 23:08:53,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=161000.0, ans=0.125 2023-10-09 23:08:54,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=161000.0, ans=0.125 2023-10-09 23:08:54,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2023-10-09 23:09:17,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-09 23:09:26,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.832e+02 2.093e+02 2.381e+02 3.465e+02, threshold=4.186e+02, percent-clipped=0.0 2023-10-09 23:09:35,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161140.0, ans=0.1 2023-10-09 23:09:43,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=161140.0, ans=0.125 2023-10-09 23:09:47,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=161186.66666666666, ans=0.0 2023-10-09 23:09:49,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=161186.66666666666, ans=0.09899494936611666 2023-10-09 23:09:55,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-10-09 23:10:18,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.48 vs. limit=10.0 2023-10-09 23:10:20,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=161326.66666666666, ans=0.2 2023-10-09 23:10:37,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=161373.33333333334, ans=0.125 2023-10-09 23:11:20,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=161560.0, ans=0.035 2023-10-09 23:11:26,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.894e+02 2.137e+02 2.627e+02 4.683e+02, threshold=4.275e+02, percent-clipped=2.0 2023-10-09 23:11:30,692 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-10-09 23:11:33,240 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.71 vs. limit=15.0 2023-10-09 23:11:39,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=161606.66666666666, ans=0.2 2023-10-09 23:11:49,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=161653.33333333334, ans=0.125 2023-10-09 23:11:50,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-10-09 23:12:07,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=161746.66666666666, ans=0.125 2023-10-09 23:12:10,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.73 vs. limit=12.0 2023-10-09 23:12:34,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=161840.0, ans=15.0 2023-10-09 23:12:39,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-10-09 23:12:59,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=161980.0, ans=0.0 2023-10-09 23:13:02,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=161980.0, ans=0.0 2023-10-09 23:13:07,757 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:13:16,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.873e+02 2.144e+02 2.446e+02 4.057e+02, threshold=4.289e+02, percent-clipped=0.0 2023-10-09 23:13:36,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=162120.0, ans=0.125 2023-10-09 23:13:39,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=15.0 2023-10-09 23:13:43,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=162166.66666666666, ans=0.0 2023-10-09 23:13:59,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=162213.33333333334, ans=0.125 2023-10-09 23:14:01,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=1.90 vs. limit=15.0 2023-10-09 23:14:04,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=162260.0, ans=0.125 2023-10-09 23:14:19,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.72 vs. limit=12.0 2023-10-09 23:14:22,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=162306.66666666666, ans=0.125 2023-10-09 23:14:42,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=22.5 2023-10-09 23:14:46,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=162446.66666666666, ans=0.125 2023-10-09 23:14:47,194 INFO [train.py:1031] (3/4) Epoch 3, batch 7500, loss[loss=0.3221, simple_loss=0.3806, pruned_loss=0.1318, over 16665.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3416, pruned_loss=0.0968, over 32030091.25 frames. ], batch size: 202, lr: 1.38e-02, grad_scale: 32.0 2023-10-09 23:14:47,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.64 vs. limit=22.5 2023-10-09 23:14:54,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=15.0 2023-10-09 23:14:59,435 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:15:03,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.948e+02 2.180e+02 2.537e+02 3.314e+02, threshold=4.359e+02, percent-clipped=0.0 2023-10-09 23:15:05,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=162493.33333333334, ans=0.125 2023-10-09 23:15:09,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=162540.0, ans=0.125 2023-10-09 23:15:16,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-10-09 23:15:22,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=162586.66666666666, ans=0.0 2023-10-09 23:15:29,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=162586.66666666666, ans=0.0 2023-10-09 23:15:33,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.49 vs. limit=22.5 2023-10-09 23:15:34,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=162633.33333333334, ans=0.5 2023-10-09 23:15:50,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=162680.0, ans=0.07 2023-10-09 23:16:17,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=162820.0, ans=0.125 2023-10-09 23:16:25,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=162820.0, ans=0.2 2023-10-09 23:16:40,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=162913.33333333334, ans=0.95 2023-10-09 23:16:55,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.819e+02 2.026e+02 2.244e+02 2.971e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-09 23:17:18,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163053.33333333334, ans=0.125 2023-10-09 23:17:26,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-09 23:17:52,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=163146.66666666666, ans=0.125 2023-10-09 23:18:01,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=163193.33333333334, ans=0.2 2023-10-09 23:18:04,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=163193.33333333334, ans=0.025 2023-10-09 23:18:08,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=163240.0, ans=0.125 2023-10-09 23:18:11,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=163240.0, ans=0.07 2023-10-09 23:18:40,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=163333.33333333334, ans=0.07 2023-10-09 23:18:56,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163426.66666666666, ans=0.1 2023-10-09 23:18:59,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.831e+02 2.034e+02 2.380e+02 3.465e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-09 23:19:18,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=163520.0, ans=0.0 2023-10-09 23:19:19,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163520.0, ans=0.1 2023-10-09 23:19:49,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=163660.0, ans=0.125 2023-10-09 23:19:54,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=163660.0, ans=0.2 2023-10-09 23:20:05,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=163706.66666666666, ans=0.125 2023-10-09 23:20:06,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=163706.66666666666, ans=0.0 2023-10-09 23:20:21,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=163800.0, ans=0.5 2023-10-09 23:20:35,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=22.5 2023-10-09 23:20:41,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.19 vs. limit=22.5 2023-10-09 23:20:46,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=163893.33333333334, ans=0.125 2023-10-09 23:20:47,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.883e+02 2.113e+02 2.504e+02 3.350e+02, threshold=4.226e+02, percent-clipped=0.0 2023-10-09 23:21:08,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=163986.66666666666, ans=0.125 2023-10-09 23:21:10,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163986.66666666666, ans=0.125 2023-10-09 23:21:36,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=164080.0, ans=0.0 2023-10-09 23:21:44,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=164080.0, ans=0.0 2023-10-09 23:21:54,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=164126.66666666666, ans=0.1 2023-10-09 23:21:59,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=164173.33333333334, ans=0.07 2023-10-09 23:22:06,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-10-09 23:22:11,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164220.0, ans=0.1 2023-10-09 23:22:23,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.77 vs. limit=15.0 2023-10-09 23:22:27,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=164313.33333333334, ans=0.125 2023-10-09 23:22:28,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.54 vs. limit=22.5 2023-10-09 23:22:33,583 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.22 vs. limit=10.0 2023-10-09 23:22:38,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=164360.0, ans=0.2 2023-10-09 23:22:46,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.816e+02 2.018e+02 2.338e+02 3.283e+02, threshold=4.036e+02, percent-clipped=0.0 2023-10-09 23:22:47,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=164360.0, ans=15.0 2023-10-09 23:22:53,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=164406.66666666666, ans=0.125 2023-10-09 23:23:01,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=164406.66666666666, ans=0.125 2023-10-09 23:23:23,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=164500.0, ans=0.0 2023-10-09 23:23:29,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-10-09 23:23:44,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=164593.33333333334, ans=22.5 2023-10-09 23:23:56,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=164640.0, ans=0.125 2023-10-09 23:24:10,637 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:24:24,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=164780.0, ans=0.0 2023-10-09 23:24:25,153 INFO [train.py:1031] (3/4) Epoch 3, batch 8000, loss[loss=0.2499, simple_loss=0.325, pruned_loss=0.08742, over 16577.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3403, pruned_loss=0.09555, over 32207482.10 frames. ], batch size: 219, lr: 1.37e-02, grad_scale: 32.0 2023-10-09 23:24:35,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=164826.66666666666, ans=0.0 2023-10-09 23:24:37,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=164826.66666666666, ans=0.09899494936611666 2023-10-09 23:24:42,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.917e+02 2.164e+02 2.596e+02 4.440e+02, threshold=4.328e+02, percent-clipped=5.0 2023-10-09 23:24:46,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=164873.33333333334, ans=0.0 2023-10-09 23:24:48,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=164873.33333333334, ans=0.2 2023-10-09 23:25:02,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=164920.0, ans=0.125 2023-10-09 23:25:06,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=164920.0, ans=0.2 2023-10-09 23:25:09,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=164966.66666666666, ans=0.0 2023-10-09 23:25:25,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=165013.33333333334, ans=0.0 2023-10-09 23:25:41,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=165106.66666666666, ans=0.125 2023-10-09 23:25:45,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=165106.66666666666, ans=0.125 2023-10-09 23:26:09,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=165200.0, ans=0.035 2023-10-09 23:26:10,061 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-09 23:26:10,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=165200.0, ans=0.2 2023-10-09 23:26:27,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.827e+02 2.038e+02 2.280e+02 3.156e+02, threshold=4.076e+02, percent-clipped=0.0 2023-10-09 23:26:54,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=165386.66666666666, ans=0.125 2023-10-09 23:26:56,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165433.33333333334, ans=0.1 2023-10-09 23:27:35,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=165526.66666666666, ans=0.0 2023-10-09 23:28:12,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=165666.66666666666, ans=0.2 2023-10-09 23:28:35,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=165760.0, ans=0.0 2023-10-09 23:28:37,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.853e+02 2.158e+02 2.601e+02 3.634e+02, threshold=4.317e+02, percent-clipped=0.0 2023-10-09 23:28:44,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=165806.66666666666, ans=0.125 2023-10-09 23:29:14,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=165900.0, ans=0.015 2023-10-09 23:29:16,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-10-09 23:29:18,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=165946.66666666666, ans=0.125 2023-10-09 23:29:49,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=166040.0, ans=0.125 2023-10-09 23:29:49,402 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:29:50,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=166086.66666666666, ans=0.2 2023-10-09 23:30:01,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-09 23:30:10,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=166133.33333333334, ans=0.0 2023-10-09 23:30:14,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=166180.0, ans=0.125 2023-10-09 23:30:28,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-10-09 23:30:34,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.811e+02 2.053e+02 2.369e+02 4.268e+02, threshold=4.106e+02, percent-clipped=0.0 2023-10-09 23:30:36,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166226.66666666666, ans=0.1 2023-10-09 23:30:49,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=166273.33333333334, ans=0.125 2023-10-09 23:31:03,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=166366.66666666666, ans=0.5 2023-10-09 23:31:07,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=166366.66666666666, ans=0.05 2023-10-09 23:31:19,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-10-09 23:31:21,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=166413.33333333334, ans=0.125 2023-10-09 23:31:27,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=166460.0, ans=0.125 2023-10-09 23:31:53,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=166553.33333333334, ans=0.0 2023-10-09 23:31:59,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=166600.0, ans=0.0 2023-10-09 23:32:00,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=166600.0, ans=0.0 2023-10-09 23:32:02,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-10-09 23:32:08,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=166600.0, ans=0.125 2023-10-09 23:32:16,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=166646.66666666666, ans=0.125 2023-10-09 23:32:29,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.821e+02 2.066e+02 2.325e+02 3.735e+02, threshold=4.133e+02, percent-clipped=0.0 2023-10-09 23:32:55,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=166786.66666666666, ans=0.1 2023-10-09 23:32:56,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=166786.66666666666, ans=0.125 2023-10-09 23:33:00,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=166833.33333333334, ans=0.0 2023-10-09 23:33:14,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166880.0, ans=0.125 2023-10-09 23:33:17,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=166880.0, ans=0.0 2023-10-09 23:33:24,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=166926.66666666666, ans=0.0 2023-10-09 23:34:10,949 INFO [train.py:1031] (3/4) Epoch 3, batch 8500, loss[loss=0.2946, simple_loss=0.3686, pruned_loss=0.1103, over 16833.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3402, pruned_loss=0.09512, over 32355265.30 frames. ], batch size: 188, lr: 1.36e-02, grad_scale: 64.0 2023-10-09 23:34:22,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167160.0, ans=0.125 2023-10-09 23:34:24,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=167160.0, ans=0.0 2023-10-09 23:34:26,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.893e+02 2.174e+02 2.445e+02 4.365e+02, threshold=4.348e+02, percent-clipped=1.0 2023-10-09 23:34:40,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=167206.66666666666, ans=0.0 2023-10-09 23:34:52,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=167253.33333333334, ans=0.0 2023-10-09 23:35:21,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=167393.33333333334, ans=0.125 2023-10-09 23:35:27,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=167393.33333333334, ans=0.05 2023-10-09 23:35:30,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-09 23:35:35,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-09 23:36:05,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=167533.33333333334, ans=10.0 2023-10-09 23:36:05,991 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:36:19,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=167580.0, ans=0.125 2023-10-09 23:36:20,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=167626.66666666666, ans=0.5 2023-10-09 23:36:26,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.873e+02 2.050e+02 2.338e+02 3.728e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-09 23:36:30,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=167626.66666666666, ans=0.125 2023-10-09 23:36:41,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167673.33333333334, ans=0.125 2023-10-09 23:36:54,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=167720.0, ans=0.09899494936611666 2023-10-09 23:37:30,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=167860.0, ans=0.1 2023-10-09 23:37:42,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=167906.66666666666, ans=0.1 2023-10-09 23:37:54,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=167953.33333333334, ans=0.0 2023-10-09 23:37:54,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=167953.33333333334, ans=0.125 2023-10-09 23:38:02,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=168000.0, ans=0.0 2023-10-09 23:38:04,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=168000.0, ans=0.0 2023-10-09 23:38:05,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168000.0, ans=0.1 2023-10-09 23:38:28,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.684e+02 1.926e+02 2.095e+02 3.028e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-09 23:38:38,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=168140.0, ans=0.1 2023-10-09 23:38:40,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=168140.0, ans=0.0 2023-10-09 23:39:03,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=168233.33333333334, ans=0.0 2023-10-09 23:39:05,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=168233.33333333334, ans=0.0 2023-10-09 23:39:12,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168280.0, ans=0.125 2023-10-09 23:39:45,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=168373.33333333334, ans=10.0 2023-10-09 23:39:48,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=168373.33333333334, ans=15.0 2023-10-09 23:39:50,611 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:39:50,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=168420.0, ans=0.125 2023-10-09 23:40:26,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=168560.0, ans=0.2 2023-10-09 23:40:28,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.684e+02 1.957e+02 2.216e+02 3.140e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-09 23:40:32,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0 2023-10-09 23:40:36,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=168606.66666666666, ans=0.125 2023-10-09 23:40:52,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=168653.33333333334, ans=0.025 2023-10-09 23:40:53,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=168653.33333333334, ans=0.2 2023-10-09 23:40:59,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=168700.0, ans=0.0 2023-10-09 23:41:00,531 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.99 vs. limit=22.5 2023-10-09 23:41:41,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.89 vs. limit=22.5 2023-10-09 23:41:47,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=168933.33333333334, ans=0.0 2023-10-09 23:41:56,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=168933.33333333334, ans=0.125 2023-10-09 23:42:17,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.841e+02 2.061e+02 2.419e+02 3.298e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-09 23:42:20,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-10-09 23:42:20,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.81 vs. limit=15.0 2023-10-09 23:42:29,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169073.33333333334, ans=0.1 2023-10-09 23:42:31,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=169120.0, ans=0.2 2023-10-09 23:42:32,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=169120.0, ans=15.0 2023-10-09 23:43:01,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=169213.33333333334, ans=0.125 2023-10-09 23:43:11,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169260.0, ans=0.1 2023-10-09 23:43:17,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.20 vs. limit=15.0 2023-10-09 23:43:18,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-09 23:43:19,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=169306.66666666666, ans=0.0 2023-10-09 23:43:23,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=169306.66666666666, ans=0.125 2023-10-09 23:43:48,861 INFO [train.py:1031] (3/4) Epoch 3, batch 9000, loss[loss=0.2607, simple_loss=0.3463, pruned_loss=0.08758, over 16810.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.339, pruned_loss=0.09412, over 32481359.00 frames. ], batch size: 146, lr: 1.35e-02, grad_scale: 32.0 2023-10-09 23:44:05,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=169493.33333333334, ans=0.0 2023-10-09 23:44:06,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.971e+02 2.282e+02 2.661e+02 3.968e+02, threshold=4.563e+02, percent-clipped=0.0 2023-10-09 23:44:07,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-10-09 23:44:21,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=169586.66666666666, ans=0.125 2023-10-09 23:44:48,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=169680.0, ans=0.125 2023-10-09 23:44:54,668 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:45:01,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=169726.66666666666, ans=0.125 2023-10-09 23:45:01,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=169726.66666666666, ans=0.125 2023-10-09 23:45:03,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-10-09 23:45:06,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=169773.33333333334, ans=0.125 2023-10-09 23:45:33,060 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-10-09 23:45:45,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-10-09 23:45:51,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.959e+02 2.175e+02 2.461e+02 2.986e+02, threshold=4.350e+02, percent-clipped=0.0 2023-10-09 23:46:00,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=170006.66666666666, ans=0.125 2023-10-09 23:46:00,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=170006.66666666666, ans=0.125 2023-10-09 23:46:01,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=170006.66666666666, ans=0.125 2023-10-09 23:46:02,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=170006.66666666666, ans=0.0 2023-10-09 23:46:15,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.81 vs. limit=10.0 2023-10-09 23:46:25,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=170100.0, ans=0.125 2023-10-09 23:46:48,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=170240.0, ans=0.125 2023-10-09 23:46:52,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=170240.0, ans=0.2 2023-10-09 23:47:21,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=170380.0, ans=0.125 2023-10-09 23:47:27,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=170380.0, ans=0.015 2023-10-09 23:47:38,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.872e+02 2.156e+02 2.542e+02 3.313e+02, threshold=4.312e+02, percent-clipped=0.0 2023-10-09 23:47:40,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=170426.66666666666, ans=0.0 2023-10-09 23:48:07,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=170566.66666666666, ans=0.035 2023-10-09 23:48:14,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=170613.33333333334, ans=0.125 2023-10-09 23:48:24,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.70 vs. limit=22.5 2023-10-09 23:48:32,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170660.0, ans=0.1 2023-10-09 23:48:36,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=170706.66666666666, ans=0.0 2023-10-09 23:48:40,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=170706.66666666666, ans=0.0 2023-10-09 23:48:45,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.76 vs. limit=10.0 2023-10-09 23:48:49,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=170753.33333333334, ans=0.0 2023-10-09 23:49:08,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=15.0 2023-10-09 23:49:26,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.845e+02 2.034e+02 2.324e+02 3.280e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-09 23:49:29,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=170893.33333333334, ans=0.07 2023-10-09 23:49:37,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.99 vs. limit=10.0 2023-10-09 23:50:03,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2023-10-09 23:50:27,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=171126.66666666666, ans=0.1 2023-10-09 23:50:30,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=171126.66666666666, ans=0.125 2023-10-09 23:50:33,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=171126.66666666666, ans=0.0 2023-10-09 23:50:35,406 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:50:52,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=171220.0, ans=0.0 2023-10-09 23:51:04,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=171266.66666666666, ans=0.0 2023-10-09 23:51:14,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=171313.33333333334, ans=0.05 2023-10-09 23:51:31,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.888e+02 2.123e+02 2.427e+02 4.332e+02, threshold=4.245e+02, percent-clipped=1.0 2023-10-09 23:51:35,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.25 vs. limit=15.0 2023-10-09 23:52:01,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171500.0, ans=0.1 2023-10-09 23:52:14,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=171546.66666666666, ans=0.2 2023-10-09 23:52:15,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=171546.66666666666, ans=0.025 2023-10-09 23:52:24,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=171593.33333333334, ans=0.125 2023-10-09 23:52:34,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=171640.0, ans=0.035 2023-10-09 23:52:36,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=171640.0, ans=0.125 2023-10-09 23:52:45,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171686.66666666666, ans=0.1 2023-10-09 23:53:02,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=171733.33333333334, ans=0.2 2023-10-09 23:53:07,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=171733.33333333334, ans=0.0 2023-10-09 23:53:09,419 INFO [train.py:1031] (3/4) Epoch 3, batch 9500, loss[loss=0.258, simple_loss=0.3368, pruned_loss=0.08963, over 16319.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3395, pruned_loss=0.09429, over 32535730.98 frames. ], batch size: 50, lr: 1.34e-02, grad_scale: 32.0 2023-10-09 23:53:22,487 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:53:22,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171826.66666666666, ans=0.125 2023-10-09 23:53:28,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.802e+02 2.025e+02 2.317e+02 4.130e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-09 23:53:59,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=171966.66666666666, ans=0.125 2023-10-09 23:54:12,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=172013.33333333334, ans=0.0 2023-10-09 23:54:30,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=172106.66666666666, ans=0.2 2023-10-09 23:54:51,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=172200.0, ans=0.125 2023-10-09 23:55:01,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=172246.66666666666, ans=0.0 2023-10-09 23:55:09,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-10-09 23:55:20,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.828e+02 2.016e+02 2.337e+02 3.369e+02, threshold=4.031e+02, percent-clipped=0.0 2023-10-09 23:55:25,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=172340.0, ans=0.125 2023-10-09 23:55:32,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=172340.0, ans=0.0 2023-10-09 23:55:47,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=172433.33333333334, ans=0.125 2023-10-09 23:55:47,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.07 vs. limit=22.5 2023-10-09 23:55:56,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=172433.33333333334, ans=0.125 2023-10-09 23:55:56,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=172433.33333333334, ans=0.125 2023-10-09 23:56:04,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172480.0, ans=0.1 2023-10-09 23:56:22,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172573.33333333334, ans=0.1 2023-10-09 23:56:24,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=172573.33333333334, ans=0.0 2023-10-09 23:56:35,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172620.0, ans=0.1 2023-10-09 23:56:52,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=172666.66666666666, ans=0.0 2023-10-09 23:57:09,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=172760.0, ans=0.125 2023-10-09 23:57:09,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-10-09 23:57:13,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.768e+02 1.962e+02 2.260e+02 3.269e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-09 23:57:24,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=172806.66666666666, ans=0.125 2023-10-09 23:57:28,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-10-09 23:57:31,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.19 vs. limit=15.0 2023-10-09 23:57:52,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=172946.66666666666, ans=0.1 2023-10-09 23:57:57,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=172946.66666666666, ans=0.07 2023-10-09 23:58:06,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=172993.33333333334, ans=0.0 2023-10-09 23:58:14,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.84 vs. limit=6.0 2023-10-09 23:58:17,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=173040.0, ans=0.0 2023-10-09 23:58:21,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=173040.0, ans=0.0 2023-10-09 23:58:22,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=173040.0, ans=0.0 2023-10-09 23:58:31,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=173086.66666666666, ans=0.05 2023-10-09 23:59:06,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.894e+02 2.152e+02 2.500e+02 3.510e+02, threshold=4.303e+02, percent-clipped=0.0 2023-10-09 23:59:22,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=173320.0, ans=0.125 2023-10-09 23:59:29,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-10-09 23:59:32,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=173366.66666666666, ans=0.125 2023-10-09 23:59:36,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173366.66666666666, ans=0.1 2023-10-09 23:59:49,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=173413.33333333334, ans=6.0 2023-10-10 00:00:14,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=173553.33333333334, ans=0.0 2023-10-10 00:00:42,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2023-10-10 00:00:55,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.840e+02 1.963e+02 2.343e+02 3.564e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-10 00:01:06,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=173740.0, ans=0.0 2023-10-10 00:01:12,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173786.66666666666, ans=0.1 2023-10-10 00:01:15,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173786.66666666666, ans=0.1 2023-10-10 00:01:21,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=173833.33333333334, ans=0.125 2023-10-10 00:01:25,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=173833.33333333334, ans=0.035 2023-10-10 00:01:39,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=15.0 2023-10-10 00:01:55,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=173973.33333333334, ans=0.2 2023-10-10 00:02:00,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=173973.33333333334, ans=0.0 2023-10-10 00:02:24,843 INFO [train.py:1031] (3/4) Epoch 3, batch 10000, loss[loss=0.3044, simple_loss=0.3682, pruned_loss=0.1203, over 16573.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3378, pruned_loss=0.09321, over 32600877.71 frames. ], batch size: 219, lr: 1.34e-02, grad_scale: 32.0 2023-10-10 00:02:41,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.804e+02 1.964e+02 2.245e+02 3.034e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 00:02:56,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=174253.33333333334, ans=0.2 2023-10-10 00:03:05,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=174253.33333333334, ans=0.125 2023-10-10 00:03:07,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=174300.0, ans=0.125 2023-10-10 00:03:08,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=174300.0, ans=0.2 2023-10-10 00:03:13,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=174300.0, ans=0.2 2023-10-10 00:03:13,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=174300.0, ans=0.0 2023-10-10 00:03:15,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=22.5 2023-10-10 00:03:38,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-10-10 00:03:49,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-10-10 00:03:52,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=174440.0, ans=0.05 2023-10-10 00:04:01,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=174486.66666666666, ans=0.125 2023-10-10 00:04:20,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.95 vs. limit=15.0 2023-10-10 00:04:31,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174626.66666666666, ans=0.1 2023-10-10 00:04:35,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.897e+02 2.043e+02 2.355e+02 3.801e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-10 00:04:54,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=174720.0, ans=0.125 2023-10-10 00:05:02,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=174766.66666666666, ans=0.0 2023-10-10 00:05:07,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=174766.66666666666, ans=0.05 2023-10-10 00:05:23,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174860.0, ans=0.1 2023-10-10 00:05:28,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.93 vs. limit=22.5 2023-10-10 00:05:33,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=22.5 2023-10-10 00:05:48,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.43 vs. limit=15.0 2023-10-10 00:05:57,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=175000.0, ans=0.07 2023-10-10 00:06:27,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.821e+02 2.000e+02 2.240e+02 2.944e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-10 00:06:29,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=175093.33333333334, ans=0.125 2023-10-10 00:06:40,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=15.0 2023-10-10 00:07:06,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-10-10 00:07:06,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=175280.0, ans=0.035 2023-10-10 00:07:14,896 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2023-10-10 00:08:10,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=175513.33333333334, ans=0.2 2023-10-10 00:08:14,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=175560.0, ans=0.0 2023-10-10 00:08:21,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.857e+02 2.268e+02 2.697e+02 4.178e+02, threshold=4.535e+02, percent-clipped=1.0 2023-10-10 00:08:32,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.40 vs. limit=22.5 2023-10-10 00:08:33,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=175606.66666666666, ans=10.0 2023-10-10 00:08:51,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=175700.0, ans=0.2 2023-10-10 00:08:59,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=175700.0, ans=0.0 2023-10-10 00:09:21,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=22.5 2023-10-10 00:09:23,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=175840.0, ans=0.09899494936611666 2023-10-10 00:09:27,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175840.0, ans=0.1 2023-10-10 00:09:46,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175886.66666666666, ans=0.1 2023-10-10 00:09:49,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.49 vs. limit=22.5 2023-10-10 00:10:07,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=175980.0, ans=0.2 2023-10-10 00:10:18,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.761e+02 1.942e+02 2.207e+02 3.961e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 00:10:41,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=176120.0, ans=0.125 2023-10-10 00:11:41,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=176353.33333333334, ans=0.0 2023-10-10 00:11:43,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=176353.33333333334, ans=0.0 2023-10-10 00:11:54,425 INFO [train.py:1031] (3/4) Epoch 3, batch 10500, loss[loss=0.2491, simple_loss=0.3349, pruned_loss=0.08166, over 16917.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3381, pruned_loss=0.09319, over 32670444.83 frames. ], batch size: 130, lr: 1.33e-02, grad_scale: 32.0 2023-10-10 00:12:00,863 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:12:10,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.809e+02 1.998e+02 2.329e+02 3.126e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-10 00:12:11,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=176493.33333333334, ans=0.0 2023-10-10 00:12:11,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.31 vs. limit=10.0 2023-10-10 00:12:14,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=15.0 2023-10-10 00:12:16,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=176540.0, ans=0.0 2023-10-10 00:12:18,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=176540.0, ans=0.125 2023-10-10 00:12:25,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-10-10 00:12:31,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=176586.66666666666, ans=0.125 2023-10-10 00:12:31,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=176586.66666666666, ans=0.0 2023-10-10 00:12:57,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=176726.66666666666, ans=0.2 2023-10-10 00:13:10,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=176773.33333333334, ans=0.125 2023-10-10 00:13:22,160 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=15.0 2023-10-10 00:13:29,897 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-10-10 00:13:32,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2023-10-10 00:13:43,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=176866.66666666666, ans=0.125 2023-10-10 00:14:09,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.904e+02 2.144e+02 2.407e+02 3.455e+02, threshold=4.288e+02, percent-clipped=0.0 2023-10-10 00:14:17,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.25 vs. limit=15.0 2023-10-10 00:14:21,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=177006.66666666666, ans=0.125 2023-10-10 00:14:32,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177053.33333333334, ans=0.125 2023-10-10 00:14:33,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=15.0 2023-10-10 00:15:03,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=177193.33333333334, ans=0.125 2023-10-10 00:15:21,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-10-10 00:15:32,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=177286.66666666666, ans=0.2 2023-10-10 00:16:04,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.757e+02 1.987e+02 2.332e+02 3.150e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-10 00:16:05,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177426.66666666666, ans=0.1 2023-10-10 00:16:09,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=177473.33333333334, ans=0.125 2023-10-10 00:16:23,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=15.0 2023-10-10 00:16:42,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=177613.33333333334, ans=0.09899494936611666 2023-10-10 00:16:45,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-10-10 00:16:45,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-10-10 00:17:06,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=177706.66666666666, ans=0.0 2023-10-10 00:17:16,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=177753.33333333334, ans=0.0 2023-10-10 00:17:16,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=177753.33333333334, ans=0.125 2023-10-10 00:17:42,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=177846.66666666666, ans=0.2 2023-10-10 00:17:47,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=177893.33333333334, ans=0.0 2023-10-10 00:17:47,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=177893.33333333334, ans=0.125 2023-10-10 00:17:49,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=177893.33333333334, ans=0.125 2023-10-10 00:17:54,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.948e+02 2.104e+02 2.512e+02 3.303e+02, threshold=4.209e+02, percent-clipped=0.0 2023-10-10 00:17:55,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=22.5 2023-10-10 00:18:32,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-10-10 00:18:46,268 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:18:54,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178173.33333333334, ans=0.1 2023-10-10 00:19:21,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.07 vs. limit=22.5 2023-10-10 00:19:32,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=178313.33333333334, ans=0.0 2023-10-10 00:19:43,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.870e+02 2.079e+02 2.308e+02 3.254e+02, threshold=4.157e+02, percent-clipped=0.0 2023-10-10 00:19:44,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=178360.0, ans=0.125 2023-10-10 00:19:45,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.07 vs. limit=15.0 2023-10-10 00:20:16,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=178500.0, ans=0.0 2023-10-10 00:20:27,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=178546.66666666666, ans=0.0 2023-10-10 00:20:47,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=178640.0, ans=0.125 2023-10-10 00:21:14,360 INFO [train.py:1031] (3/4) Epoch 3, batch 11000, loss[loss=0.2765, simple_loss=0.362, pruned_loss=0.09549, over 16804.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3377, pruned_loss=0.09297, over 32697104.39 frames. ], batch size: 175, lr: 1.32e-02, grad_scale: 64.0 2023-10-10 00:21:31,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.973e+02 2.340e+02 2.706e+02 3.666e+02, threshold=4.680e+02, percent-clipped=0.0 2023-10-10 00:21:32,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=178826.66666666666, ans=0.2 2023-10-10 00:21:57,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=178966.66666666666, ans=0.125 2023-10-10 00:22:14,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=179013.33333333334, ans=0.0 2023-10-10 00:22:50,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=179153.33333333334, ans=0.2 2023-10-10 00:23:00,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=179200.0, ans=0.125 2023-10-10 00:23:01,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.27 vs. limit=15.0 2023-10-10 00:23:02,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=179200.0, ans=0.125 2023-10-10 00:23:06,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179246.66666666666, ans=0.1 2023-10-10 00:23:31,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.742e+02 1.981e+02 2.277e+02 4.044e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-10 00:23:35,851 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:23:37,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179340.0, ans=0.1 2023-10-10 00:23:43,654 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:23:53,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=179386.66666666666, ans=0.0 2023-10-10 00:23:54,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=179386.66666666666, ans=0.0 2023-10-10 00:24:13,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-10-10 00:24:19,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=179480.0, ans=0.125 2023-10-10 00:24:21,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=179526.66666666666, ans=0.0 2023-10-10 00:24:26,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=179526.66666666666, ans=0.015 2023-10-10 00:24:28,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=15.0 2023-10-10 00:25:28,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.768e+02 1.925e+02 2.322e+02 4.129e+02, threshold=3.850e+02, percent-clipped=1.0 2023-10-10 00:25:47,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=22.5 2023-10-10 00:25:51,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=179853.33333333334, ans=0.125 2023-10-10 00:26:12,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2023-10-10 00:26:19,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179993.33333333334, ans=0.1 2023-10-10 00:26:33,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=180040.0, ans=0.125 2023-10-10 00:26:33,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=180040.0, ans=0.1 2023-10-10 00:26:35,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.86 vs. limit=15.0 2023-10-10 00:26:36,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.82 vs. limit=22.5 2023-10-10 00:27:24,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.872e+02 2.178e+02 2.398e+02 3.467e+02, threshold=4.356e+02, percent-clipped=0.0 2023-10-10 00:27:25,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=180226.66666666666, ans=0.0 2023-10-10 00:27:30,260 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:27:49,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=180366.66666666666, ans=0.125 2023-10-10 00:27:58,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-10-10 00:28:00,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=180413.33333333334, ans=0.125 2023-10-10 00:28:05,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=180413.33333333334, ans=0.2 2023-10-10 00:28:16,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=180460.0, ans=0.125 2023-10-10 00:28:20,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=180460.0, ans=0.125 2023-10-10 00:28:21,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=180460.0, ans=0.2 2023-10-10 00:28:33,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=180553.33333333334, ans=0.04949747468305833 2023-10-10 00:28:46,815 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:29:14,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=180693.33333333334, ans=0.125 2023-10-10 00:29:15,866 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.887e+02 2.152e+02 2.473e+02 4.069e+02, threshold=4.303e+02, percent-clipped=0.0 2023-10-10 00:29:17,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=180740.0, ans=0.125 2023-10-10 00:29:18,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-10-10 00:29:28,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=180786.66666666666, ans=0.125 2023-10-10 00:29:35,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=180786.66666666666, ans=0.0 2023-10-10 00:29:57,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=180880.0, ans=0.125 2023-10-10 00:29:57,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.48 vs. limit=12.0 2023-10-10 00:30:15,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.15 vs. limit=22.5 2023-10-10 00:30:22,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=180973.33333333334, ans=0.125 2023-10-10 00:30:38,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.22 vs. limit=22.5 2023-10-10 00:30:44,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-10-10 00:30:49,442 INFO [train.py:1031] (3/4) Epoch 3, batch 11500, loss[loss=0.2779, simple_loss=0.3551, pruned_loss=0.1003, over 16605.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3372, pruned_loss=0.09255, over 32742179.29 frames. ], batch size: 241, lr: 1.31e-02, grad_scale: 16.0 2023-10-10 00:30:52,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.11 vs. limit=6.0 2023-10-10 00:31:08,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.859e+02 2.091e+02 2.411e+02 3.307e+02, threshold=4.181e+02, percent-clipped=0.0 2023-10-10 00:31:30,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181300.0, ans=0.125 2023-10-10 00:31:45,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=181346.66666666666, ans=0.125 2023-10-10 00:31:48,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=181346.66666666666, ans=0.2 2023-10-10 00:32:36,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-10-10 00:32:38,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-10-10 00:32:40,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-10-10 00:32:41,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=181533.33333333334, ans=0.0 2023-10-10 00:32:58,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=181626.66666666666, ans=0.125 2023-10-10 00:33:05,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.861e+02 2.064e+02 2.387e+02 2.869e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-10 00:33:11,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2023-10-10 00:33:13,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=181673.33333333334, ans=0.125 2023-10-10 00:33:21,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=181720.0, ans=0.0 2023-10-10 00:33:52,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181860.0, ans=0.125 2023-10-10 00:34:05,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=181906.66666666666, ans=0.5 2023-10-10 00:34:09,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=181906.66666666666, ans=0.125 2023-10-10 00:34:11,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=181953.33333333334, ans=0.2 2023-10-10 00:34:21,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.52 vs. limit=22.5 2023-10-10 00:34:24,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=182000.0, ans=0.0 2023-10-10 00:34:30,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=182000.0, ans=15.0 2023-10-10 00:34:51,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=182093.33333333334, ans=0.125 2023-10-10 00:34:52,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.865e+02 2.145e+02 2.419e+02 3.775e+02, threshold=4.290e+02, percent-clipped=0.0 2023-10-10 00:34:59,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.24 vs. limit=22.5 2023-10-10 00:35:09,669 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:35:12,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=182186.66666666666, ans=0.2 2023-10-10 00:35:36,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=182280.0, ans=0.125 2023-10-10 00:35:44,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=182326.66666666666, ans=0.0 2023-10-10 00:35:46,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.66 vs. limit=15.0 2023-10-10 00:36:07,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=182373.33333333334, ans=0.125 2023-10-10 00:36:34,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-10 00:36:42,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=182513.33333333334, ans=0.125 2023-10-10 00:36:43,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=182560.0, ans=0.125 2023-10-10 00:36:53,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.762e+02 1.933e+02 2.168e+02 3.098e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-10 00:36:54,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=182606.66666666666, ans=0.0 2023-10-10 00:36:55,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=182606.66666666666, ans=0.0 2023-10-10 00:36:56,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=182606.66666666666, ans=0.125 2023-10-10 00:37:06,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=182653.33333333334, ans=0.04949747468305833 2023-10-10 00:37:12,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182653.33333333334, ans=0.1 2023-10-10 00:37:22,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=182700.0, ans=0.0 2023-10-10 00:37:23,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=182700.0, ans=0.125 2023-10-10 00:37:34,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182746.66666666666, ans=0.1 2023-10-10 00:37:40,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=182793.33333333334, ans=0.125 2023-10-10 00:37:40,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=182793.33333333334, ans=0.125 2023-10-10 00:37:40,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182793.33333333334, ans=0.1 2023-10-10 00:37:50,745 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:37:54,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=182840.0, ans=0.125 2023-10-10 00:38:04,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182886.66666666666, ans=0.1 2023-10-10 00:38:14,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=182886.66666666666, ans=0.0 2023-10-10 00:38:25,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=182933.33333333334, ans=0.125 2023-10-10 00:38:29,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182980.0, ans=0.1 2023-10-10 00:38:48,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.893e+02 2.048e+02 2.378e+02 3.433e+02, threshold=4.095e+02, percent-clipped=0.0 2023-10-10 00:39:19,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=183166.66666666666, ans=0.125 2023-10-10 00:39:32,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=183213.33333333334, ans=0.125 2023-10-10 00:39:32,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-10-10 00:39:35,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183213.33333333334, ans=0.1 2023-10-10 00:39:46,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=183306.66666666666, ans=0.0 2023-10-10 00:39:49,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183306.66666666666, ans=0.1 2023-10-10 00:40:19,590 INFO [train.py:1031] (3/4) Epoch 3, batch 12000, loss[loss=0.23, simple_loss=0.3107, pruned_loss=0.07464, over 15586.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3366, pruned_loss=0.09187, over 32739377.50 frames. ], batch size: 35, lr: 1.30e-02, grad_scale: 32.0 2023-10-10 00:40:28,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=183446.66666666666, ans=0.0 2023-10-10 00:40:38,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183493.33333333334, ans=0.1 2023-10-10 00:40:41,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.944e+02 2.254e+02 2.569e+02 3.650e+02, threshold=4.508e+02, percent-clipped=0.0 2023-10-10 00:41:01,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-10-10 00:41:09,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=183633.33333333334, ans=0.0 2023-10-10 00:41:25,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=183680.0, ans=0.015 2023-10-10 00:41:26,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=183680.0, ans=0.0 2023-10-10 00:41:26,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=183680.0, ans=0.0 2023-10-10 00:41:27,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=183680.0, ans=0.2 2023-10-10 00:41:50,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.10 vs. limit=15.0 2023-10-10 00:41:59,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=183820.0, ans=0.0 2023-10-10 00:42:16,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=183913.33333333334, ans=0.125 2023-10-10 00:42:29,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.43 vs. limit=22.5 2023-10-10 00:42:32,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=183960.0, ans=0.0 2023-10-10 00:42:33,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.775e+02 1.941e+02 2.163e+02 3.202e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 00:42:34,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=184006.66666666666, ans=0.0 2023-10-10 00:42:40,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-10-10 00:42:44,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=184053.33333333334, ans=0.1 2023-10-10 00:42:56,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=184100.0, ans=0.2 2023-10-10 00:43:31,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=184240.0, ans=0.125 2023-10-10 00:43:40,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=184286.66666666666, ans=0.125 2023-10-10 00:43:54,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.90 vs. limit=10.0 2023-10-10 00:44:01,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=184380.0, ans=0.125 2023-10-10 00:44:01,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184380.0, ans=0.1 2023-10-10 00:44:18,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.994e+02 2.215e+02 2.724e+02 3.801e+02, threshold=4.431e+02, percent-clipped=0.0 2023-10-10 00:44:19,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184473.33333333334, ans=0.1 2023-10-10 00:44:34,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=184520.0, ans=0.0 2023-10-10 00:44:35,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=184520.0, ans=0.1 2023-10-10 00:44:55,561 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-10-10 00:45:14,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=184706.66666666666, ans=0.125 2023-10-10 00:45:24,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=184753.33333333334, ans=0.2 2023-10-10 00:45:24,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-10-10 00:45:32,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=184753.33333333334, ans=0.125 2023-10-10 00:45:39,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=184800.0, ans=0.0 2023-10-10 00:45:49,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=184846.66666666666, ans=0.125 2023-10-10 00:45:50,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=184846.66666666666, ans=0.125 2023-10-10 00:45:55,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=184846.66666666666, ans=0.125 2023-10-10 00:45:59,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=184893.33333333334, ans=0.125 2023-10-10 00:46:02,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=184893.33333333334, ans=0.125 2023-10-10 00:46:03,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=184893.33333333334, ans=0.125 2023-10-10 00:46:07,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.49 vs. limit=15.0 2023-10-10 00:46:08,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.796e+02 2.074e+02 2.351e+02 3.276e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-10 00:46:10,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184940.0, ans=0.1 2023-10-10 00:46:30,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=184986.66666666666, ans=0.125 2023-10-10 00:46:59,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=185126.66666666666, ans=0.125 2023-10-10 00:47:10,340 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:47:33,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=185266.66666666666, ans=0.2 2023-10-10 00:47:45,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=185313.33333333334, ans=0.0 2023-10-10 00:48:01,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=185360.0, ans=0.125 2023-10-10 00:48:01,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=185360.0, ans=0.0 2023-10-10 00:48:04,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.849e+02 2.130e+02 2.520e+02 4.323e+02, threshold=4.259e+02, percent-clipped=1.0 2023-10-10 00:48:06,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=185406.66666666666, ans=0.07 2023-10-10 00:48:36,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=185500.0, ans=0.125 2023-10-10 00:48:54,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.03 vs. limit=10.0 2023-10-10 00:48:58,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=15.0 2023-10-10 00:49:00,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=12.0 2023-10-10 00:49:06,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=185640.0, ans=0.125 2023-10-10 00:49:10,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=185640.0, ans=0.125 2023-10-10 00:49:18,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=185686.66666666666, ans=0.0 2023-10-10 00:49:24,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=185686.66666666666, ans=0.0 2023-10-10 00:49:38,519 INFO [train.py:1031] (3/4) Epoch 3, batch 12500, loss[loss=0.2666, simple_loss=0.3404, pruned_loss=0.09637, over 16605.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3361, pruned_loss=0.09185, over 32742606.10 frames. ], batch size: 241, lr: 1.29e-02, grad_scale: 32.0 2023-10-10 00:49:53,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=185826.66666666666, ans=0.0 2023-10-10 00:49:57,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=185826.66666666666, ans=0.125 2023-10-10 00:49:59,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.901e+02 2.157e+02 2.436e+02 4.050e+02, threshold=4.315e+02, percent-clipped=0.0 2023-10-10 00:50:11,136 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:50:16,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=185920.0, ans=0.2 2023-10-10 00:50:35,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=186013.33333333334, ans=0.09899494936611666 2023-10-10 00:50:43,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=186060.0, ans=0.2 2023-10-10 00:50:48,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=186060.0, ans=0.0 2023-10-10 00:51:12,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186153.33333333334, ans=0.1 2023-10-10 00:51:45,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.873e+02 2.084e+02 2.346e+02 3.459e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-10 00:52:13,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=186433.33333333334, ans=0.125 2023-10-10 00:52:14,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=186433.33333333334, ans=0.05 2023-10-10 00:52:26,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=186480.0, ans=0.125 2023-10-10 00:52:33,710 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:52:45,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=186573.33333333334, ans=0.0 2023-10-10 00:52:56,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186620.0, ans=0.1 2023-10-10 00:53:08,132 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:53:28,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.27 vs. limit=15.0 2023-10-10 00:53:35,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.774e+02 2.060e+02 2.264e+02 3.097e+02, threshold=4.119e+02, percent-clipped=0.0 2023-10-10 00:53:37,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=186806.66666666666, ans=0.04949747468305833 2023-10-10 00:53:46,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=186806.66666666666, ans=0.0 2023-10-10 00:53:51,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=186853.33333333334, ans=0.125 2023-10-10 00:54:23,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=186993.33333333334, ans=0.1 2023-10-10 00:54:33,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=12.0 2023-10-10 00:54:51,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=187133.33333333334, ans=0.125 2023-10-10 00:54:59,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.90 vs. limit=15.0 2023-10-10 00:55:23,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.950e+02 2.197e+02 2.480e+02 3.980e+02, threshold=4.394e+02, percent-clipped=0.0 2023-10-10 00:55:25,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=187273.33333333334, ans=0.2 2023-10-10 00:55:29,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=187273.33333333334, ans=0.125 2023-10-10 00:55:32,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=187273.33333333334, ans=10.0 2023-10-10 00:55:38,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=187320.0, ans=0.125 2023-10-10 00:55:54,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=187366.66666666666, ans=0.5 2023-10-10 00:55:58,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187413.33333333334, ans=0.1 2023-10-10 00:56:14,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=187460.0, ans=0.2 2023-10-10 00:56:19,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=187506.66666666666, ans=0.125 2023-10-10 00:56:41,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=187600.0, ans=0.0 2023-10-10 00:56:43,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=187600.0, ans=0.07 2023-10-10 00:56:50,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=187600.0, ans=0.0 2023-10-10 00:56:51,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=187600.0, ans=0.04949747468305833 2023-10-10 00:57:09,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187693.33333333334, ans=0.1 2023-10-10 00:57:12,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.906e+02 2.199e+02 2.559e+02 3.758e+02, threshold=4.397e+02, percent-clipped=0.0 2023-10-10 00:57:17,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=187740.0, ans=0.125 2023-10-10 00:57:18,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=187740.0, ans=0.2 2023-10-10 00:57:19,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=187740.0, ans=0.1 2023-10-10 00:57:24,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=22.5 2023-10-10 00:57:30,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=187786.66666666666, ans=0.125 2023-10-10 00:57:45,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=187880.0, ans=0.125 2023-10-10 00:57:48,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.61 vs. limit=15.0 2023-10-10 00:57:55,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=187926.66666666666, ans=0.125 2023-10-10 00:58:23,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=188020.0, ans=0.125 2023-10-10 00:58:24,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-10-10 00:58:26,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-10 00:58:30,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=12.0 2023-10-10 00:58:38,042 INFO [train.py:1031] (3/4) Epoch 3, batch 13000, loss[loss=0.26, simple_loss=0.3392, pruned_loss=0.09045, over 16897.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3367, pruned_loss=0.09205, over 32743459.44 frames. ], batch size: 130, lr: 1.29e-02, grad_scale: 32.0 2023-10-10 00:58:48,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=188160.0, ans=0.0 2023-10-10 00:58:55,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=188160.0, ans=0.125 2023-10-10 00:58:57,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.813e+02 2.076e+02 2.415e+02 3.629e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-10 00:59:15,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=188253.33333333334, ans=0.125 2023-10-10 00:59:24,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=188253.33333333334, ans=0.125 2023-10-10 00:59:27,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=15.0 2023-10-10 00:59:53,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=188393.33333333334, ans=0.0 2023-10-10 01:00:03,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=188440.0, ans=0.0 2023-10-10 01:00:25,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188486.66666666666, ans=0.1 2023-10-10 01:00:29,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=188533.33333333334, ans=0.015 2023-10-10 01:00:36,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=188533.33333333334, ans=0.125 2023-10-10 01:00:43,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=188580.0, ans=0.125 2023-10-10 01:00:44,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=188580.0, ans=0.125 2023-10-10 01:00:58,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.808e+02 2.001e+02 2.367e+02 2.873e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-10 01:01:06,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=188673.33333333334, ans=0.125 2023-10-10 01:01:37,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-10-10 01:01:47,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=188860.0, ans=0.0 2023-10-10 01:01:52,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=188860.0, ans=0.07 2023-10-10 01:02:15,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=188953.33333333334, ans=0.0 2023-10-10 01:02:38,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=189046.66666666666, ans=0.0 2023-10-10 01:02:52,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.857e+02 2.173e+02 2.492e+02 3.491e+02, threshold=4.346e+02, percent-clipped=0.0 2023-10-10 01:02:54,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=189140.0, ans=0.125 2023-10-10 01:03:17,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=189233.33333333334, ans=0.125 2023-10-10 01:03:24,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=189233.33333333334, ans=0.04949747468305833 2023-10-10 01:03:46,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=189326.66666666666, ans=0.125 2023-10-10 01:03:46,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.86 vs. limit=22.5 2023-10-10 01:03:59,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=189373.33333333334, ans=0.0 2023-10-10 01:03:59,609 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:04:07,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=189420.0, ans=0.125 2023-10-10 01:04:14,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189466.66666666666, ans=0.1 2023-10-10 01:04:26,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189513.33333333334, ans=0.1 2023-10-10 01:04:42,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.925e+02 2.140e+02 2.499e+02 3.876e+02, threshold=4.281e+02, percent-clipped=0.0 2023-10-10 01:04:49,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=189606.66666666666, ans=0.2 2023-10-10 01:05:04,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=189700.0, ans=0.1 2023-10-10 01:05:14,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=189700.0, ans=0.0 2023-10-10 01:05:18,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=189746.66666666666, ans=0.125 2023-10-10 01:05:26,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=189793.33333333334, ans=0.0 2023-10-10 01:05:40,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=189840.0, ans=0.2 2023-10-10 01:05:49,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=189886.66666666666, ans=0.125 2023-10-10 01:05:58,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=189886.66666666666, ans=0.5 2023-10-10 01:06:13,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=189980.0, ans=0.125 2023-10-10 01:06:30,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=190026.66666666666, ans=0.125 2023-10-10 01:06:35,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.900e+02 2.146e+02 2.486e+02 3.706e+02, threshold=4.292e+02, percent-clipped=0.0 2023-10-10 01:07:09,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=190213.33333333334, ans=0.125 2023-10-10 01:07:19,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=190260.0, ans=0.1 2023-10-10 01:07:27,054 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-10-10 01:07:32,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=190306.66666666666, ans=0.0 2023-10-10 01:07:33,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=190306.66666666666, ans=0.125 2023-10-10 01:08:03,933 INFO [train.py:1031] (3/4) Epoch 3, batch 13500, loss[loss=0.2626, simple_loss=0.3417, pruned_loss=0.09174, over 16890.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3357, pruned_loss=0.09152, over 32751285.24 frames. ], batch size: 146, lr: 1.28e-02, grad_scale: 32.0 2023-10-10 01:08:06,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.57 vs. limit=15.0 2023-10-10 01:08:14,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=190493.33333333334, ans=0.0 2023-10-10 01:08:18,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.09 vs. limit=12.0 2023-10-10 01:08:24,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.721e+02 1.948e+02 2.320e+02 4.142e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 01:08:31,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-10-10 01:08:37,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=190586.66666666666, ans=0.1 2023-10-10 01:08:40,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=190586.66666666666, ans=10.0 2023-10-10 01:09:13,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=190726.66666666666, ans=0.05 2023-10-10 01:09:13,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190726.66666666666, ans=0.1 2023-10-10 01:09:14,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=190726.66666666666, ans=0.125 2023-10-10 01:09:17,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2023-10-10 01:09:24,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-10-10 01:09:30,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.36 vs. limit=6.0 2023-10-10 01:09:35,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=190820.0, ans=0.125 2023-10-10 01:09:46,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190866.66666666666, ans=0.1 2023-10-10 01:09:46,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=190866.66666666666, ans=0.125 2023-10-10 01:09:52,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=190913.33333333334, ans=0.125 2023-10-10 01:10:00,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190960.0, ans=0.1 2023-10-10 01:10:06,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=190960.0, ans=0.0 2023-10-10 01:10:10,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.842e+02 2.047e+02 2.375e+02 3.850e+02, threshold=4.094e+02, percent-clipped=0.0 2023-10-10 01:10:12,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=191006.66666666666, ans=0.0 2023-10-10 01:10:26,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191053.33333333334, ans=0.125 2023-10-10 01:10:29,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=191100.0, ans=0.125 2023-10-10 01:10:31,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=191100.0, ans=0.1 2023-10-10 01:10:37,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191146.66666666666, ans=0.1 2023-10-10 01:11:11,965 INFO [train.py:1031] (3/4) Epoch 4, batch 0, loss[loss=0.25, simple_loss=0.3217, pruned_loss=0.08915, over 16671.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3217, pruned_loss=0.08915, over 16671.00 frames. ], batch size: 241, lr: 1.07e-02, grad_scale: 32.0 2023-10-10 01:11:11,967 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-10 01:11:20,236 INFO [train.py:1063] (3/4) Epoch 4, validation: loss=0.2505, simple_loss=0.3358, pruned_loss=0.08267, over 1020973.00 frames. 2023-10-10 01:11:20,237 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-10 01:11:47,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=191263.33333333334, ans=0.125 2023-10-10 01:12:02,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=191310.0, ans=0.0 2023-10-10 01:12:04,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-10-10 01:12:32,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=191450.0, ans=0.0 2023-10-10 01:12:34,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.686e+02 1.901e+02 2.036e+02 2.676e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 01:12:36,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.03 vs. limit=15.0 2023-10-10 01:12:42,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191496.66666666666, ans=0.1 2023-10-10 01:13:13,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=191590.0, ans=0.0 2023-10-10 01:13:20,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191636.66666666666, ans=0.0 2023-10-10 01:13:21,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=191636.66666666666, ans=0.95 2023-10-10 01:13:37,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=191730.0, ans=0.125 2023-10-10 01:13:44,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=191730.0, ans=0.0 2023-10-10 01:14:08,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=191823.33333333334, ans=0.1 2023-10-10 01:14:17,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.45 vs. limit=6.0 2023-10-10 01:14:23,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191916.66666666666, ans=0.1 2023-10-10 01:14:24,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=191916.66666666666, ans=0.125 2023-10-10 01:14:25,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.766e+02 1.944e+02 2.177e+02 3.808e+02, threshold=3.888e+02, percent-clipped=1.0 2023-10-10 01:14:31,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191963.33333333334, ans=0.1 2023-10-10 01:14:35,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=191963.33333333334, ans=0.09899494936611666 2023-10-10 01:14:42,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=192010.0, ans=0.0 2023-10-10 01:14:54,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=192056.66666666666, ans=0.0 2023-10-10 01:15:06,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=192103.33333333334, ans=0.2 2023-10-10 01:15:10,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=192103.33333333334, ans=0.0 2023-10-10 01:15:18,262 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:15:36,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192243.33333333334, ans=0.1 2023-10-10 01:15:45,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192243.33333333334, ans=0.1 2023-10-10 01:15:50,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=192290.0, ans=0.05 2023-10-10 01:15:57,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=192290.0, ans=0.0 2023-10-10 01:16:18,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.763e+02 1.957e+02 2.307e+02 3.327e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-10 01:16:22,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=192430.0, ans=0.015 2023-10-10 01:16:38,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=192476.66666666666, ans=12.0 2023-10-10 01:16:54,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=192523.33333333334, ans=0.125 2023-10-10 01:17:03,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=192570.0, ans=0.2 2023-10-10 01:17:34,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=192710.0, ans=0.125 2023-10-10 01:17:45,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=192756.66666666666, ans=0.0 2023-10-10 01:17:57,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=192803.33333333334, ans=0.0 2023-10-10 01:18:04,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.39 vs. limit=10.0 2023-10-10 01:18:08,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.321e+02 1.714e+02 2.037e+02 2.314e+02 3.392e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 01:18:17,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=192896.66666666666, ans=0.09899494936611666 2023-10-10 01:18:24,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=12.0 2023-10-10 01:19:14,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=193130.0, ans=0.0 2023-10-10 01:19:14,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=193130.0, ans=0.5 2023-10-10 01:19:19,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-10-10 01:19:21,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=193176.66666666666, ans=0.0 2023-10-10 01:19:45,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-10-10 01:19:59,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.949e+02 2.312e+02 2.674e+02 4.571e+02, threshold=4.624e+02, percent-clipped=1.0 2023-10-10 01:20:06,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=193363.33333333334, ans=0.0 2023-10-10 01:20:09,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=193363.33333333334, ans=10.0 2023-10-10 01:20:28,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=193456.66666666666, ans=0.0 2023-10-10 01:20:34,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=193456.66666666666, ans=0.125 2023-10-10 01:20:38,616 INFO [train.py:1031] (3/4) Epoch 4, batch 500, loss[loss=0.2417, simple_loss=0.3204, pruned_loss=0.0815, over 16801.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3308, pruned_loss=0.08769, over 7257592.92 frames. ], batch size: 146, lr: 1.07e-02, grad_scale: 32.0 2023-10-10 01:20:44,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=193503.33333333334, ans=0.025 2023-10-10 01:21:04,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.20 vs. limit=15.0 2023-10-10 01:21:07,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=193596.66666666666, ans=0.125 2023-10-10 01:21:09,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=193596.66666666666, ans=0.0 2023-10-10 01:21:10,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.36 vs. limit=15.0 2023-10-10 01:21:18,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=193643.33333333334, ans=0.125 2023-10-10 01:21:31,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=193690.0, ans=0.125 2023-10-10 01:21:31,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.44 vs. limit=15.0 2023-10-10 01:21:38,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.98 vs. limit=6.0 2023-10-10 01:21:49,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=193783.33333333334, ans=0.125 2023-10-10 01:21:50,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.814e+02 1.997e+02 2.239e+02 3.318e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-10 01:22:02,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=193830.0, ans=0.025 2023-10-10 01:22:03,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=193830.0, ans=0.2 2023-10-10 01:22:04,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=193830.0, ans=0.0 2023-10-10 01:22:09,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.98 vs. limit=22.5 2023-10-10 01:22:20,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-10-10 01:22:30,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193970.0, ans=0.1 2023-10-10 01:22:34,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=193970.0, ans=0.125 2023-10-10 01:22:42,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.09 vs. limit=15.0 2023-10-10 01:22:45,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=194016.66666666666, ans=0.0 2023-10-10 01:22:54,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=194063.33333333334, ans=0.125 2023-10-10 01:22:58,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=194063.33333333334, ans=0.0 2023-10-10 01:23:08,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.28 vs. limit=22.5 2023-10-10 01:23:25,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-10-10 01:23:31,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=194203.33333333334, ans=0.125 2023-10-10 01:23:32,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=194203.33333333334, ans=0.125 2023-10-10 01:23:33,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=194250.0, ans=0.0 2023-10-10 01:23:39,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.712e+02 1.980e+02 2.236e+02 4.011e+02, threshold=3.960e+02, percent-clipped=1.0 2023-10-10 01:23:50,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.19 vs. limit=10.0 2023-10-10 01:23:59,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-10-10 01:23:59,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194343.33333333334, ans=0.1 2023-10-10 01:24:13,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-10-10 01:24:31,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=194483.33333333334, ans=0.125 2023-10-10 01:24:33,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194483.33333333334, ans=0.1 2023-10-10 01:24:50,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-10-10 01:24:57,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=194576.66666666666, ans=0.125 2023-10-10 01:25:19,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=194670.0, ans=0.125 2023-10-10 01:25:19,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=194670.0, ans=0.0 2023-10-10 01:25:32,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.786e+02 1.920e+02 2.132e+02 2.991e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-10 01:25:37,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=194763.33333333334, ans=0.0 2023-10-10 01:25:37,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=194763.33333333334, ans=0.2 2023-10-10 01:25:38,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=194763.33333333334, ans=0.0 2023-10-10 01:25:55,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=194810.0, ans=0.2 2023-10-10 01:26:17,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=194903.33333333334, ans=0.0 2023-10-10 01:26:32,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.37 vs. limit=22.5 2023-10-10 01:26:34,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.25 vs. limit=10.0 2023-10-10 01:27:06,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=195090.0, ans=0.125 2023-10-10 01:27:27,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.797e+02 1.968e+02 2.163e+02 3.994e+02, threshold=3.937e+02, percent-clipped=1.0 2023-10-10 01:27:36,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=195230.0, ans=0.2 2023-10-10 01:27:39,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195230.0, ans=0.1 2023-10-10 01:27:59,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=195323.33333333334, ans=0.125 2023-10-10 01:28:14,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=195370.0, ans=0.125 2023-10-10 01:28:18,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=195416.66666666666, ans=0.0 2023-10-10 01:28:22,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.92 vs. limit=15.0 2023-10-10 01:28:30,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=195463.33333333334, ans=0.0 2023-10-10 01:28:55,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=195556.66666666666, ans=0.1 2023-10-10 01:28:55,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=195556.66666666666, ans=0.125 2023-10-10 01:28:56,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=195556.66666666666, ans=0.0 2023-10-10 01:29:06,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=195603.33333333334, ans=0.0 2023-10-10 01:29:08,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=195603.33333333334, ans=0.125 2023-10-10 01:29:21,476 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.657e+02 1.922e+02 2.206e+02 3.120e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-10 01:29:58,434 INFO [train.py:1031] (3/4) Epoch 4, batch 1000, loss[loss=0.2196, simple_loss=0.3117, pruned_loss=0.06373, over 16921.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3314, pruned_loss=0.0881, over 12920724.55 frames. ], batch size: 104, lr: 1.06e-02, grad_scale: 32.0 2023-10-10 01:29:58,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.15 vs. limit=6.0 2023-10-10 01:30:00,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=22.5 2023-10-10 01:30:07,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=195883.33333333334, ans=0.0 2023-10-10 01:30:34,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=195976.66666666666, ans=0.125 2023-10-10 01:30:35,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=15.0 2023-10-10 01:30:38,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-10-10 01:31:07,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.783e+02 2.000e+02 2.205e+02 3.095e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-10 01:31:39,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=196256.66666666666, ans=0.125 2023-10-10 01:31:46,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.65 vs. limit=15.0 2023-10-10 01:31:52,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=196303.33333333334, ans=0.125 2023-10-10 01:32:16,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=196396.66666666666, ans=0.125 2023-10-10 01:32:36,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=196490.0, ans=0.125 2023-10-10 01:32:38,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.33 vs. limit=15.0 2023-10-10 01:32:38,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=22.5 2023-10-10 01:32:41,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196490.0, ans=0.1 2023-10-10 01:32:44,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=196536.66666666666, ans=0.125 2023-10-10 01:32:48,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=196536.66666666666, ans=0.0 2023-10-10 01:33:02,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.783e+02 1.989e+02 2.444e+02 3.708e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-10 01:33:29,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=196676.66666666666, ans=0.09899494936611666 2023-10-10 01:33:40,105 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:33:41,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=196723.33333333334, ans=0.07 2023-10-10 01:33:42,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=196723.33333333334, ans=0.125 2023-10-10 01:34:00,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-10-10 01:34:07,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=196863.33333333334, ans=0.0 2023-10-10 01:34:42,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=197003.33333333334, ans=0.0 2023-10-10 01:35:00,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.679e+02 1.904e+02 2.231e+02 3.118e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-10 01:35:02,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197050.0, ans=0.1 2023-10-10 01:35:05,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=197096.66666666666, ans=0.125 2023-10-10 01:35:14,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=197096.66666666666, ans=0.125 2023-10-10 01:35:16,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=197143.33333333334, ans=0.125 2023-10-10 01:35:36,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-10 01:35:46,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=197236.66666666666, ans=0.125 2023-10-10 01:35:49,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=197236.66666666666, ans=0.1 2023-10-10 01:35:58,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2023-10-10 01:36:20,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=197376.66666666666, ans=0.2 2023-10-10 01:36:23,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=197423.33333333334, ans=0.2 2023-10-10 01:36:23,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=197423.33333333334, ans=0.0 2023-10-10 01:36:47,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=197516.66666666666, ans=0.0 2023-10-10 01:36:49,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.701e+02 1.863e+02 2.113e+02 3.574e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-10 01:36:52,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=197516.66666666666, ans=0.125 2023-10-10 01:37:13,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-10 01:37:18,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.52 vs. limit=15.0 2023-10-10 01:37:20,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=197656.66666666666, ans=0.125 2023-10-10 01:37:40,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=197750.0, ans=0.2 2023-10-10 01:37:45,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=197750.0, ans=0.1 2023-10-10 01:37:47,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=197750.0, ans=0.125 2023-10-10 01:38:18,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=197890.0, ans=0.125 2023-10-10 01:38:21,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=197890.0, ans=0.125 2023-10-10 01:38:34,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=197936.66666666666, ans=0.0 2023-10-10 01:38:44,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.709e+02 1.896e+02 2.165e+02 3.150e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-10 01:38:53,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-10 01:39:26,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=198123.33333333334, ans=0.125 2023-10-10 01:39:27,804 INFO [train.py:1031] (3/4) Epoch 4, batch 1500, loss[loss=0.2256, simple_loss=0.3171, pruned_loss=0.06702, over 16939.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3287, pruned_loss=0.08616, over 17334122.14 frames. ], batch size: 82, lr: 1.06e-02, grad_scale: 32.0 2023-10-10 01:39:40,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198216.66666666666, ans=0.1 2023-10-10 01:40:06,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198310.0, ans=0.1 2023-10-10 01:40:26,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198403.33333333334, ans=0.125 2023-10-10 01:40:30,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=198403.33333333334, ans=0.125 2023-10-10 01:40:40,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.717e+02 1.867e+02 2.074e+02 2.969e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-10 01:40:51,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-10-10 01:40:56,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=198496.66666666666, ans=0.2 2023-10-10 01:40:58,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=198543.33333333334, ans=0.125 2023-10-10 01:41:23,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=198636.66666666666, ans=0.95 2023-10-10 01:42:09,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=198823.33333333334, ans=0.2 2023-10-10 01:42:13,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=198823.33333333334, ans=0.5 2023-10-10 01:42:18,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=198823.33333333334, ans=0.125 2023-10-10 01:42:20,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=198870.0, ans=15.0 2023-10-10 01:42:21,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198870.0, ans=0.125 2023-10-10 01:42:32,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=198916.66666666666, ans=0.1 2023-10-10 01:42:36,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.771e+02 1.942e+02 2.224e+02 3.344e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 01:42:39,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198916.66666666666, ans=0.1 2023-10-10 01:43:18,133 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:44:22,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=199383.33333333334, ans=0.125 2023-10-10 01:44:25,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.785e+02 1.983e+02 2.316e+02 3.126e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-10 01:44:32,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.84 vs. limit=15.0 2023-10-10 01:44:47,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=199476.66666666666, ans=0.05 2023-10-10 01:45:03,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=15.0 2023-10-10 01:45:03,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=199523.33333333334, ans=0.0 2023-10-10 01:45:04,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=199523.33333333334, ans=0.125 2023-10-10 01:45:08,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=199570.0, ans=0.0 2023-10-10 01:45:33,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=199663.33333333334, ans=0.125 2023-10-10 01:45:41,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=199710.0, ans=0.0 2023-10-10 01:45:42,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-10-10 01:46:22,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.687e+02 1.856e+02 2.154e+02 3.052e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-10 01:46:43,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=199943.33333333334, ans=0.125 2023-10-10 01:47:09,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=200036.66666666666, ans=0.0 2023-10-10 01:47:20,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.15 vs. limit=15.0 2023-10-10 01:47:22,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-10-10 01:47:23,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=200130.0, ans=0.125 2023-10-10 01:47:25,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=200130.0, ans=0.1 2023-10-10 01:47:32,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=200130.0, ans=0.2 2023-10-10 01:48:00,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=200223.33333333334, ans=0.0 2023-10-10 01:48:14,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=200270.0, ans=0.0 2023-10-10 01:48:15,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=200270.0, ans=0.2 2023-10-10 01:48:16,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=200270.0, ans=0.025 2023-10-10 01:48:25,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.783e+02 1.928e+02 2.163e+02 3.381e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-10 01:48:44,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=200410.0, ans=0.2 2023-10-10 01:49:06,901 INFO [train.py:1031] (3/4) Epoch 4, batch 2000, loss[loss=0.2532, simple_loss=0.3417, pruned_loss=0.08232, over 16866.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3289, pruned_loss=0.08584, over 20780364.35 frames. ], batch size: 130, lr: 1.05e-02, grad_scale: 64.0 2023-10-10 01:49:37,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=200596.66666666666, ans=0.125 2023-10-10 01:49:42,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.63 vs. limit=15.0 2023-10-10 01:49:52,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=200643.33333333334, ans=0.0 2023-10-10 01:50:29,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.747e+02 1.954e+02 2.274e+02 3.166e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-10 01:50:54,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=200876.66666666666, ans=0.04949747468305833 2023-10-10 01:50:55,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=200876.66666666666, ans=0.2 2023-10-10 01:50:59,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.71 vs. limit=15.0 2023-10-10 01:51:21,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200970.0, ans=0.1 2023-10-10 01:51:25,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=200970.0, ans=0.0 2023-10-10 01:51:36,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201016.66666666666, ans=0.1 2023-10-10 01:51:43,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=201016.66666666666, ans=0.125 2023-10-10 01:51:48,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=201063.33333333334, ans=0.125 2023-10-10 01:52:09,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=201110.0, ans=0.0 2023-10-10 01:52:23,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=201156.66666666666, ans=0.125 2023-10-10 01:52:25,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=201156.66666666666, ans=0.1 2023-10-10 01:52:33,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=201203.33333333334, ans=0.125 2023-10-10 01:52:42,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=201250.0, ans=0.2 2023-10-10 01:52:46,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.786e+02 2.055e+02 2.517e+02 4.090e+02, threshold=4.109e+02, percent-clipped=1.0 2023-10-10 01:52:48,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=201250.0, ans=0.0 2023-10-10 01:53:04,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=201343.33333333334, ans=0.125 2023-10-10 01:53:07,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=201343.33333333334, ans=0.0 2023-10-10 01:53:21,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=201390.0, ans=0.125 2023-10-10 01:53:26,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=22.5 2023-10-10 01:53:31,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=201436.66666666666, ans=0.05 2023-10-10 01:53:40,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=201483.33333333334, ans=0.0 2023-10-10 01:53:49,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=201530.0, ans=0.0 2023-10-10 01:53:53,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=201530.0, ans=0.0 2023-10-10 01:54:01,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.74 vs. limit=10.0 2023-10-10 01:54:25,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=201670.0, ans=0.0 2023-10-10 01:54:39,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.812e+02 2.033e+02 2.373e+02 3.202e+02, threshold=4.066e+02, percent-clipped=0.0 2023-10-10 01:54:57,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.08 vs. limit=15.0 2023-10-10 01:55:01,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.33 vs. limit=22.5 2023-10-10 01:55:07,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=201856.66666666666, ans=0.1 2023-10-10 01:55:18,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=201903.33333333334, ans=0.0 2023-10-10 01:55:26,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-10 01:55:34,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=201950.0, ans=0.125 2023-10-10 01:55:57,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.14 vs. limit=15.0 2023-10-10 01:55:58,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=202043.33333333334, ans=0.125 2023-10-10 01:55:58,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=202043.33333333334, ans=0.125 2023-10-10 01:56:07,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=202090.0, ans=6.0 2023-10-10 01:56:29,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=202183.33333333334, ans=0.125 2023-10-10 01:56:31,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.800e+02 2.097e+02 2.321e+02 3.365e+02, threshold=4.194e+02, percent-clipped=0.0 2023-10-10 01:56:36,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=202230.0, ans=0.125 2023-10-10 01:56:36,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=202230.0, ans=0.0 2023-10-10 01:56:38,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=202230.0, ans=0.0 2023-10-10 01:56:46,945 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:56:47,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=202276.66666666666, ans=0.125 2023-10-10 01:56:55,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=202276.66666666666, ans=0.2 2023-10-10 01:56:55,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=202276.66666666666, ans=10.0 2023-10-10 01:57:13,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=202370.0, ans=0.125 2023-10-10 01:57:27,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=202463.33333333334, ans=0.125 2023-10-10 01:57:29,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=202463.33333333334, ans=0.125 2023-10-10 01:57:32,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=202463.33333333334, ans=0.125 2023-10-10 01:57:40,039 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-10-10 01:57:52,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.16 vs. limit=22.5 2023-10-10 01:57:57,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=202556.66666666666, ans=0.125 2023-10-10 01:58:09,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=202603.33333333334, ans=0.125 2023-10-10 01:58:12,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=202650.0, ans=0.125 2023-10-10 01:58:17,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.821e+02 2.148e+02 2.337e+02 3.245e+02, threshold=4.295e+02, percent-clipped=0.0 2023-10-10 01:58:24,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=202696.66666666666, ans=0.07 2023-10-10 01:58:31,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=202743.33333333334, ans=0.0 2023-10-10 01:58:33,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=202743.33333333334, ans=0.125 2023-10-10 01:58:37,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202743.33333333334, ans=0.1 2023-10-10 01:58:40,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=202743.33333333334, ans=0.0 2023-10-10 01:58:52,517 INFO [train.py:1031] (3/4) Epoch 4, batch 2500, loss[loss=0.2626, simple_loss=0.3382, pruned_loss=0.09354, over 16511.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3288, pruned_loss=0.08591, over 23441366.74 frames. ], batch size: 266, lr: 1.04e-02, grad_scale: 32.0 2023-10-10 01:59:03,157 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-10 01:59:20,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=202930.0, ans=0.0 2023-10-10 01:59:22,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=202976.66666666666, ans=0.125 2023-10-10 01:59:25,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=202976.66666666666, ans=0.0 2023-10-10 01:59:32,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=203023.33333333334, ans=0.0 2023-10-10 01:59:36,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203023.33333333334, ans=0.1 2023-10-10 01:59:38,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=203023.33333333334, ans=0.125 2023-10-10 01:59:41,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=203023.33333333334, ans=0.125 2023-10-10 01:59:50,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=203070.0, ans=0.04949747468305833 2023-10-10 01:59:51,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203070.0, ans=0.1 2023-10-10 01:59:53,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2023-10-10 01:59:59,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.781e+02 1.997e+02 2.293e+02 3.596e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-10 02:00:14,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=203210.0, ans=0.125 2023-10-10 02:00:27,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203256.66666666666, ans=0.125 2023-10-10 02:00:27,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=203256.66666666666, ans=0.2 2023-10-10 02:00:30,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=203256.66666666666, ans=0.0 2023-10-10 02:01:22,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=203443.33333333334, ans=0.125 2023-10-10 02:01:29,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=203490.0, ans=0.125 2023-10-10 02:01:30,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.88 vs. limit=10.0 2023-10-10 02:01:31,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=203490.0, ans=0.0 2023-10-10 02:01:49,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=203583.33333333334, ans=0.0 2023-10-10 02:01:51,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.804e+02 2.005e+02 2.297e+02 3.145e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 02:01:54,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=203630.0, ans=0.0 2023-10-10 02:02:29,464 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:02:44,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-10-10 02:03:07,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=203910.0, ans=0.2 2023-10-10 02:03:49,059 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.689e+02 1.857e+02 2.109e+02 3.521e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-10 02:03:49,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=204050.0, ans=0.95 2023-10-10 02:04:21,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=204190.0, ans=0.125 2023-10-10 02:04:54,171 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:05:26,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=204470.0, ans=0.0 2023-10-10 02:05:27,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.46 vs. limit=10.0 2023-10-10 02:05:44,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.796e+02 2.076e+02 2.594e+02 3.813e+02, threshold=4.152e+02, percent-clipped=2.0 2023-10-10 02:06:05,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=204610.0, ans=0.0 2023-10-10 02:06:10,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=204610.0, ans=0.125 2023-10-10 02:06:11,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=204610.0, ans=0.1 2023-10-10 02:07:00,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=204796.66666666666, ans=0.125 2023-10-10 02:07:09,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=204843.33333333334, ans=0.125 2023-10-10 02:07:18,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=204890.0, ans=0.0 2023-10-10 02:07:40,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204983.33333333334, ans=0.1 2023-10-10 02:07:42,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.749e+02 1.882e+02 2.081e+02 2.740e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-10 02:08:13,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-10 02:08:17,518 INFO [train.py:1031] (3/4) Epoch 4, batch 3000, loss[loss=0.236, simple_loss=0.3189, pruned_loss=0.0766, over 16822.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3276, pruned_loss=0.08537, over 25533158.65 frames. ], batch size: 146, lr: 1.04e-02, grad_scale: 32.0 2023-10-10 02:08:17,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=205170.0, ans=0.125 2023-10-10 02:08:23,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=15.0 2023-10-10 02:08:35,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-10 02:08:43,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=205263.33333333334, ans=0.125 2023-10-10 02:08:47,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=205263.33333333334, ans=0.125 2023-10-10 02:09:31,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.744e+02 1.941e+02 2.104e+02 3.008e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-10 02:09:36,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=205496.66666666666, ans=0.07 2023-10-10 02:09:36,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=205496.66666666666, ans=0.2 2023-10-10 02:09:41,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=205496.66666666666, ans=0.0 2023-10-10 02:10:23,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=205636.66666666666, ans=0.125 2023-10-10 02:10:36,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=205683.33333333334, ans=0.125 2023-10-10 02:11:01,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=205823.33333333334, ans=0.0 2023-10-10 02:11:09,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.49 vs. limit=10.0 2023-10-10 02:11:11,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=205870.0, ans=0.0 2023-10-10 02:11:13,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.85 vs. limit=22.5 2023-10-10 02:11:13,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.28 vs. limit=6.0 2023-10-10 02:11:13,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.34 vs. limit=22.5 2023-10-10 02:11:29,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.706e+02 1.926e+02 2.287e+02 3.827e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-10 02:11:36,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=205963.33333333334, ans=0.05 2023-10-10 02:11:36,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205963.33333333334, ans=0.1 2023-10-10 02:11:45,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=206010.0, ans=0.125 2023-10-10 02:11:58,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=15.0 2023-10-10 02:12:02,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=206056.66666666666, ans=0.0 2023-10-10 02:12:08,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=206103.33333333334, ans=0.125 2023-10-10 02:12:13,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=206103.33333333334, ans=0.125 2023-10-10 02:12:28,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=206150.0, ans=0.125 2023-10-10 02:12:44,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=6.0 2023-10-10 02:12:47,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206243.33333333334, ans=0.125 2023-10-10 02:12:52,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206243.33333333334, ans=0.1 2023-10-10 02:13:02,057 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:13:18,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=206336.66666666666, ans=0.125 2023-10-10 02:13:30,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.778e+02 1.961e+02 2.323e+02 3.840e+02, threshold=3.921e+02, percent-clipped=0.0 2023-10-10 02:14:14,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=206570.0, ans=10.0 2023-10-10 02:14:41,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.58 vs. limit=22.5 2023-10-10 02:15:14,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=206850.0, ans=0.0 2023-10-10 02:15:19,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.880e+02 2.112e+02 2.409e+02 4.134e+02, threshold=4.224e+02, percent-clipped=1.0 2023-10-10 02:15:28,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206896.66666666666, ans=0.125 2023-10-10 02:15:38,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=206943.33333333334, ans=0.05 2023-10-10 02:15:43,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=206943.33333333334, ans=0.125 2023-10-10 02:15:52,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=206990.0, ans=0.125 2023-10-10 02:15:58,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=206990.0, ans=0.125 2023-10-10 02:16:11,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=207083.33333333334, ans=0.0 2023-10-10 02:16:18,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.71 vs. limit=15.0 2023-10-10 02:17:01,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=207270.0, ans=0.95 2023-10-10 02:17:12,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.805e+02 2.015e+02 2.324e+02 3.356e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-10 02:17:19,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207363.33333333334, ans=0.125 2023-10-10 02:17:28,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=207410.0, ans=0.125 2023-10-10 02:17:40,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=207456.66666666666, ans=0.125 2023-10-10 02:17:47,557 INFO [train.py:1031] (3/4) Epoch 4, batch 3500, loss[loss=0.2253, simple_loss=0.3057, pruned_loss=0.0724, over 16249.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3273, pruned_loss=0.08524, over 27149672.91 frames. ], batch size: 50, lr: 1.03e-02, grad_scale: 16.0 2023-10-10 02:17:51,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=207503.33333333334, ans=0.125 2023-10-10 02:17:52,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=207503.33333333334, ans=0.1 2023-10-10 02:17:53,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=207503.33333333334, ans=0.125 2023-10-10 02:18:19,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=207643.33333333334, ans=0.125 2023-10-10 02:18:27,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=207643.33333333334, ans=0.125 2023-10-10 02:18:32,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.51 vs. limit=10.0 2023-10-10 02:18:33,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2023-10-10 02:18:38,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=207690.0, ans=0.125 2023-10-10 02:18:39,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=207690.0, ans=0.07 2023-10-10 02:18:46,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.34 vs. limit=15.0 2023-10-10 02:19:01,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=207783.33333333334, ans=0.0 2023-10-10 02:19:03,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.839e+02 2.013e+02 2.281e+02 3.580e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-10 02:19:08,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=207830.0, ans=0.0 2023-10-10 02:19:37,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=207923.33333333334, ans=0.2 2023-10-10 02:19:43,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207923.33333333334, ans=0.1 2023-10-10 02:19:47,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=207970.0, ans=0.0 2023-10-10 02:19:57,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.72 vs. limit=15.0 2023-10-10 02:20:18,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=208063.33333333334, ans=0.2 2023-10-10 02:20:39,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=208156.66666666666, ans=0.2 2023-10-10 02:20:46,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=208203.33333333334, ans=0.125 2023-10-10 02:21:05,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.757e+02 2.049e+02 2.459e+02 4.060e+02, threshold=4.098e+02, percent-clipped=1.0 2023-10-10 02:21:27,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=208343.33333333334, ans=0.2 2023-10-10 02:21:34,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=208390.0, ans=0.2 2023-10-10 02:21:36,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.04 vs. limit=22.5 2023-10-10 02:21:39,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=208390.0, ans=0.0 2023-10-10 02:21:52,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=208436.66666666666, ans=0.0 2023-10-10 02:22:09,489 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:22:13,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208530.0, ans=0.1 2023-10-10 02:23:01,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.679e+02 1.833e+02 2.154e+02 2.925e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 02:23:07,059 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:23:08,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=208763.33333333334, ans=0.125 2023-10-10 02:23:09,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208763.33333333334, ans=0.1 2023-10-10 02:23:15,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=208763.33333333334, ans=0.2 2023-10-10 02:23:31,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-10 02:23:32,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=208856.66666666666, ans=0.0 2023-10-10 02:23:40,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=208903.33333333334, ans=0.5 2023-10-10 02:23:47,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=208903.33333333334, ans=0.125 2023-10-10 02:24:04,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.37 vs. limit=22.5 2023-10-10 02:24:05,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=208996.66666666666, ans=0.07 2023-10-10 02:24:06,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208996.66666666666, ans=0.1 2023-10-10 02:24:06,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=208996.66666666666, ans=0.0 2023-10-10 02:24:17,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=209043.33333333334, ans=0.2 2023-10-10 02:24:34,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=209090.0, ans=0.125 2023-10-10 02:24:37,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.97 vs. limit=15.0 2023-10-10 02:24:52,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.699e+02 1.847e+02 2.039e+02 2.847e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-10 02:25:07,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-10-10 02:25:28,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=209370.0, ans=0.125 2023-10-10 02:25:48,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=209416.66666666666, ans=0.07 2023-10-10 02:25:55,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=209463.33333333334, ans=0.0 2023-10-10 02:26:10,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209510.0, ans=0.1 2023-10-10 02:26:15,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-10-10 02:26:43,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.772e+02 2.028e+02 2.261e+02 3.926e+02, threshold=4.055e+02, percent-clipped=1.0 2023-10-10 02:26:46,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=209696.66666666666, ans=0.0 2023-10-10 02:26:55,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=209696.66666666666, ans=0.0 2023-10-10 02:27:04,669 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:27:05,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=209743.33333333334, ans=0.125 2023-10-10 02:27:20,192 INFO [train.py:1031] (3/4) Epoch 4, batch 4000, loss[loss=0.24, simple_loss=0.2872, pruned_loss=0.09635, over 11902.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3268, pruned_loss=0.08519, over 28387206.46 frames. ], batch size: 440, lr: 1.03e-02, grad_scale: 32.0 2023-10-10 02:27:21,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=209836.66666666666, ans=0.125 2023-10-10 02:27:22,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.21 vs. limit=15.0 2023-10-10 02:27:25,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-10-10 02:27:27,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-10 02:27:34,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-10-10 02:27:40,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=15.0 2023-10-10 02:27:54,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=209930.0, ans=0.1 2023-10-10 02:27:59,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-10-10 02:28:05,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=210023.33333333334, ans=0.125 2023-10-10 02:28:09,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=22.5 2023-10-10 02:28:14,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=210023.33333333334, ans=0.0 2023-10-10 02:28:16,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=15.0 2023-10-10 02:28:30,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=210116.66666666666, ans=0.125 2023-10-10 02:28:30,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=210116.66666666666, ans=0.2 2023-10-10 02:28:31,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=210116.66666666666, ans=0.0 2023-10-10 02:28:34,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210116.66666666666, ans=0.1 2023-10-10 02:28:37,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.752e+02 2.040e+02 2.309e+02 3.400e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-10 02:29:23,853 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:29:25,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.03 vs. limit=6.0 2023-10-10 02:29:34,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=210396.66666666666, ans=0.125 2023-10-10 02:30:06,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=210490.0, ans=0.0 2023-10-10 02:30:12,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=210536.66666666666, ans=0.0 2023-10-10 02:30:26,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=210583.33333333334, ans=0.0 2023-10-10 02:30:32,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.866e+02 2.079e+02 2.472e+02 4.433e+02, threshold=4.159e+02, percent-clipped=1.0 2023-10-10 02:30:39,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=210630.0, ans=0.0 2023-10-10 02:31:21,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-10-10 02:31:24,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=210770.0, ans=0.125 2023-10-10 02:31:34,449 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:31:41,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=210816.66666666666, ans=0.05 2023-10-10 02:31:43,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=210863.33333333334, ans=0.1 2023-10-10 02:31:52,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=210863.33333333334, ans=0.125 2023-10-10 02:32:05,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=210910.0, ans=0.0 2023-10-10 02:32:18,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.63 vs. limit=10.0 2023-10-10 02:32:27,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211003.33333333334, ans=0.125 2023-10-10 02:32:32,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=211050.0, ans=0.125 2023-10-10 02:32:38,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.689e+02 1.894e+02 2.080e+02 3.126e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-10 02:32:42,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=211096.66666666666, ans=0.2 2023-10-10 02:32:44,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=211096.66666666666, ans=0.0 2023-10-10 02:32:47,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=211096.66666666666, ans=0.2 2023-10-10 02:32:49,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=211096.66666666666, ans=0.125 2023-10-10 02:32:53,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211143.33333333334, ans=0.1 2023-10-10 02:33:02,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-10 02:33:27,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=211283.33333333334, ans=0.0 2023-10-10 02:33:29,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=211283.33333333334, ans=0.125 2023-10-10 02:33:32,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-10-10 02:33:42,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211330.0, ans=0.1 2023-10-10 02:33:43,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=211330.0, ans=0.125 2023-10-10 02:34:04,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=211423.33333333334, ans=0.125 2023-10-10 02:34:11,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=211470.0, ans=0.125 2023-10-10 02:34:12,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=211470.0, ans=0.0 2023-10-10 02:34:19,348 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:34:29,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.902e+02 2.143e+02 2.545e+02 3.766e+02, threshold=4.287e+02, percent-clipped=0.0 2023-10-10 02:34:45,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-10 02:34:49,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=211610.0, ans=0.1 2023-10-10 02:34:53,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211610.0, ans=0.1 2023-10-10 02:34:56,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-10-10 02:34:56,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-10 02:35:00,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=211656.66666666666, ans=0.125 2023-10-10 02:35:12,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=211703.33333333334, ans=0.05 2023-10-10 02:35:16,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=211750.0, ans=0.0 2023-10-10 02:35:41,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=211843.33333333334, ans=0.125 2023-10-10 02:35:44,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=211843.33333333334, ans=0.0 2023-10-10 02:35:54,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211890.0, ans=0.1 2023-10-10 02:36:11,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=211936.66666666666, ans=0.0 2023-10-10 02:36:16,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.15 vs. limit=22.5 2023-10-10 02:36:19,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=211983.33333333334, ans=0.125 2023-10-10 02:36:28,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.797e+02 2.105e+02 2.350e+02 3.467e+02, threshold=4.210e+02, percent-clipped=0.0 2023-10-10 02:36:42,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.22 vs. limit=12.0 2023-10-10 02:37:03,881 INFO [train.py:1031] (3/4) Epoch 4, batch 4500, loss[loss=0.2176, simple_loss=0.2951, pruned_loss=0.07005, over 16512.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3273, pruned_loss=0.08506, over 29402523.52 frames. ], batch size: 50, lr: 1.02e-02, grad_scale: 32.0 2023-10-10 02:37:08,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=212170.0, ans=0.1 2023-10-10 02:37:17,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.20 vs. limit=15.0 2023-10-10 02:37:20,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=212216.66666666666, ans=0.1 2023-10-10 02:37:53,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=212356.66666666666, ans=0.0 2023-10-10 02:38:08,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-10-10 02:38:14,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.752e+02 1.966e+02 2.322e+02 3.324e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-10 02:38:23,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=22.5 2023-10-10 02:38:26,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212543.33333333334, ans=0.1 2023-10-10 02:38:27,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=212543.33333333334, ans=0.0 2023-10-10 02:38:34,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=212543.33333333334, ans=0.0 2023-10-10 02:38:35,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=212543.33333333334, ans=0.125 2023-10-10 02:38:44,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=212590.0, ans=0.0 2023-10-10 02:38:49,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2023-10-10 02:39:06,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=212683.33333333334, ans=0.125 2023-10-10 02:39:06,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=212683.33333333334, ans=0.0 2023-10-10 02:39:21,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.73 vs. limit=10.0 2023-10-10 02:39:36,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=212823.33333333334, ans=0.0 2023-10-10 02:39:57,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.869e+02 2.079e+02 2.382e+02 3.819e+02, threshold=4.157e+02, percent-clipped=0.0 2023-10-10 02:40:00,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212963.33333333334, ans=0.1 2023-10-10 02:40:02,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=212963.33333333334, ans=0.125 2023-10-10 02:40:31,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=213056.66666666666, ans=0.1 2023-10-10 02:40:37,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=213103.33333333334, ans=0.04949747468305833 2023-10-10 02:40:42,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=213103.33333333334, ans=0.0 2023-10-10 02:40:45,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=213150.0, ans=0.1 2023-10-10 02:41:05,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=213243.33333333334, ans=0.125 2023-10-10 02:41:08,554 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:41:16,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-10-10 02:41:23,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=213290.0, ans=0.1 2023-10-10 02:41:35,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=213383.33333333334, ans=0.2 2023-10-10 02:41:45,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.697e+02 1.996e+02 2.270e+02 3.294e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-10 02:42:01,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=213476.66666666666, ans=0.07 2023-10-10 02:42:04,333 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-10-10 02:42:07,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=213523.33333333334, ans=0.125 2023-10-10 02:42:17,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=213570.0, ans=0.125 2023-10-10 02:42:21,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-10-10 02:42:22,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.79 vs. limit=22.5 2023-10-10 02:42:23,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=213570.0, ans=0.0 2023-10-10 02:42:45,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=213663.33333333334, ans=0.2 2023-10-10 02:42:51,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=213663.33333333334, ans=0.0 2023-10-10 02:42:53,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213710.0, ans=0.125 2023-10-10 02:42:56,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=213710.0, ans=0.0 2023-10-10 02:43:02,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213710.0, ans=0.1 2023-10-10 02:43:11,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=213756.66666666666, ans=0.0 2023-10-10 02:43:16,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=213756.66666666666, ans=0.1 2023-10-10 02:43:17,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=213803.33333333334, ans=0.0 2023-10-10 02:43:24,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=213803.33333333334, ans=0.2 2023-10-10 02:43:30,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=213850.0, ans=0.125 2023-10-10 02:43:30,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=213850.0, ans=0.0 2023-10-10 02:43:37,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.779e+02 2.094e+02 2.547e+02 3.384e+02, threshold=4.188e+02, percent-clipped=0.0 2023-10-10 02:43:52,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=213943.33333333334, ans=0.125 2023-10-10 02:43:59,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213943.33333333334, ans=0.125 2023-10-10 02:44:05,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=213990.0, ans=0.025 2023-10-10 02:44:13,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=214036.66666666666, ans=0.125 2023-10-10 02:44:39,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=214130.0, ans=0.05 2023-10-10 02:44:39,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=214130.0, ans=0.125 2023-10-10 02:44:46,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214176.66666666666, ans=0.1 2023-10-10 02:44:53,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-10-10 02:44:59,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=214223.33333333334, ans=0.2 2023-10-10 02:45:06,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214223.33333333334, ans=0.1 2023-10-10 02:45:08,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=214223.33333333334, ans=0.125 2023-10-10 02:45:21,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=214316.66666666666, ans=0.05 2023-10-10 02:45:21,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-10 02:45:21,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=214316.66666666666, ans=0.0 2023-10-10 02:45:24,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.24 vs. limit=15.0 2023-10-10 02:45:29,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.683e+02 1.835e+02 2.046e+02 3.436e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-10 02:45:30,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214363.33333333334, ans=0.1 2023-10-10 02:45:40,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=214363.33333333334, ans=0.1 2023-10-10 02:45:49,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-10-10 02:45:51,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214410.0, ans=0.1 2023-10-10 02:45:58,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=214456.66666666666, ans=0.125 2023-10-10 02:46:04,303 INFO [train.py:1031] (3/4) Epoch 4, batch 5000, loss[loss=0.2771, simple_loss=0.3397, pruned_loss=0.1073, over 16063.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3267, pruned_loss=0.08491, over 30171503.86 frames. ], batch size: 297, lr: 1.02e-02, grad_scale: 32.0 2023-10-10 02:46:09,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.75 vs. limit=15.0 2023-10-10 02:46:15,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.04 vs. limit=15.0 2023-10-10 02:46:23,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=214550.0, ans=0.2 2023-10-10 02:47:19,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.706e+02 1.872e+02 2.127e+02 3.239e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-10 02:47:20,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-10-10 02:47:29,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=214830.0, ans=0.125 2023-10-10 02:47:30,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=214830.0, ans=0.0 2023-10-10 02:48:02,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=214970.0, ans=0.0 2023-10-10 02:48:12,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=15.0 2023-10-10 02:48:20,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=215063.33333333334, ans=0.125 2023-10-10 02:49:09,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2023-10-10 02:49:12,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.275e+02 1.784e+02 2.019e+02 2.257e+02 2.952e+02, threshold=4.039e+02, percent-clipped=0.0 2023-10-10 02:49:18,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=215296.66666666666, ans=0.0 2023-10-10 02:49:25,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=215343.33333333334, ans=0.2 2023-10-10 02:49:25,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=215343.33333333334, ans=0.0 2023-10-10 02:49:32,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=215343.33333333334, ans=0.125 2023-10-10 02:49:55,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215483.33333333334, ans=0.125 2023-10-10 02:50:04,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=215483.33333333334, ans=0.125 2023-10-10 02:50:08,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=215530.0, ans=0.125 2023-10-10 02:50:48,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=215670.0, ans=0.125 2023-10-10 02:51:03,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.821e+02 2.072e+02 2.469e+02 3.778e+02, threshold=4.145e+02, percent-clipped=0.0 2023-10-10 02:51:16,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=215810.0, ans=0.2 2023-10-10 02:51:17,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-10-10 02:51:22,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=215810.0, ans=0.09899494936611666 2023-10-10 02:51:51,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=215950.0, ans=0.0 2023-10-10 02:51:54,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=215950.0, ans=0.0 2023-10-10 02:52:10,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=215996.66666666666, ans=0.2 2023-10-10 02:52:13,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=22.5 2023-10-10 02:52:19,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=216043.33333333334, ans=0.07 2023-10-10 02:52:20,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=216043.33333333334, ans=0.5 2023-10-10 02:52:23,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=216043.33333333334, ans=0.125 2023-10-10 02:52:25,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=216043.33333333334, ans=0.125 2023-10-10 02:52:37,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=216136.66666666666, ans=0.125 2023-10-10 02:52:55,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216183.33333333334, ans=0.1 2023-10-10 02:52:59,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.706e+02 1.947e+02 2.292e+02 4.215e+02, threshold=3.894e+02, percent-clipped=1.0 2023-10-10 02:53:12,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2023-10-10 02:53:47,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=216416.66666666666, ans=0.125 2023-10-10 02:53:51,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216463.33333333334, ans=0.1 2023-10-10 02:53:56,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.52 vs. limit=22.5 2023-10-10 02:54:09,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=216510.0, ans=0.0 2023-10-10 02:54:45,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.724e+02 1.893e+02 2.092e+02 3.210e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-10 02:54:56,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=216743.33333333334, ans=0.125 2023-10-10 02:55:13,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=216790.0, ans=0.0 2023-10-10 02:55:19,781 INFO [train.py:1031] (3/4) Epoch 4, batch 5500, loss[loss=0.2457, simple_loss=0.3239, pruned_loss=0.08381, over 16888.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3261, pruned_loss=0.08443, over 30755678.15 frames. ], batch size: 110, lr: 1.01e-02, grad_scale: 32.0 2023-10-10 02:55:25,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=216836.66666666666, ans=0.0 2023-10-10 02:55:42,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=216930.0, ans=0.0 2023-10-10 02:55:50,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=216976.66666666666, ans=0.125 2023-10-10 02:56:15,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.76 vs. limit=10.0 2023-10-10 02:56:29,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217116.66666666666, ans=0.1 2023-10-10 02:56:33,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.755e+02 1.980e+02 2.233e+02 3.178e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-10 02:56:41,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=217163.33333333334, ans=0.125 2023-10-10 02:56:50,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=217210.0, ans=0.07 2023-10-10 02:56:53,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=217210.0, ans=0.0 2023-10-10 02:57:05,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=217256.66666666666, ans=0.125 2023-10-10 02:57:06,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=217303.33333333334, ans=0.125 2023-10-10 02:57:09,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-10 02:57:34,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217396.66666666666, ans=0.1 2023-10-10 02:57:39,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=217443.33333333334, ans=0.125 2023-10-10 02:57:55,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=217490.0, ans=0.0 2023-10-10 02:58:02,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=217536.66666666666, ans=0.1 2023-10-10 02:58:22,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=217583.33333333334, ans=0.0 2023-10-10 02:58:22,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.939e+02 2.178e+02 2.611e+02 3.893e+02, threshold=4.357e+02, percent-clipped=0.0 2023-10-10 02:58:33,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-10-10 02:58:54,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=217723.33333333334, ans=0.0 2023-10-10 02:58:58,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=217770.0, ans=0.0 2023-10-10 02:59:31,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=217910.0, ans=0.0 2023-10-10 02:59:52,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=217956.66666666666, ans=0.07 2023-10-10 02:59:57,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=218003.33333333334, ans=0.125 2023-10-10 02:59:59,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=218003.33333333334, ans=0.0 2023-10-10 03:00:03,338 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.29 vs. limit=22.5 2023-10-10 03:00:14,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.721e+02 1.922e+02 2.159e+02 3.016e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-10 03:00:19,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.81 vs. limit=22.5 2023-10-10 03:00:32,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=218143.33333333334, ans=0.125 2023-10-10 03:00:51,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2023-10-10 03:01:14,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=218330.0, ans=0.125 2023-10-10 03:01:24,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-10-10 03:01:32,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=218376.66666666666, ans=0.05 2023-10-10 03:01:39,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=218423.33333333334, ans=0.2 2023-10-10 03:01:44,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=218470.0, ans=0.125 2023-10-10 03:02:05,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.738e+02 1.942e+02 2.193e+02 3.532e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-10 03:02:18,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.26 vs. limit=15.0 2023-10-10 03:02:22,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=218610.0, ans=0.1 2023-10-10 03:02:46,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=218703.33333333334, ans=0.125 2023-10-10 03:02:48,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=218703.33333333334, ans=0.2 2023-10-10 03:03:02,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=218796.66666666666, ans=0.1 2023-10-10 03:03:12,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=15.0 2023-10-10 03:03:18,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=218843.33333333334, ans=0.0 2023-10-10 03:03:34,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=218890.0, ans=0.0 2023-10-10 03:03:42,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.73 vs. limit=22.5 2023-10-10 03:03:45,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-10-10 03:03:59,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.773e+02 1.967e+02 2.314e+02 3.785e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-10 03:04:11,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=219076.66666666666, ans=0.1 2023-10-10 03:04:14,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=219076.66666666666, ans=0.125 2023-10-10 03:04:16,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-10 03:04:23,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=219123.33333333334, ans=0.2 2023-10-10 03:04:33,214 INFO [train.py:1031] (3/4) Epoch 4, batch 6000, loss[loss=0.2434, simple_loss=0.3116, pruned_loss=0.0876, over 15601.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3261, pruned_loss=0.08435, over 31234021.55 frames. ], batch size: 35, lr: 1.00e-02, grad_scale: 32.0 2023-10-10 03:04:43,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=219216.66666666666, ans=0.2 2023-10-10 03:04:58,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=219263.33333333334, ans=0.0 2023-10-10 03:05:17,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=219356.66666666666, ans=0.02 2023-10-10 03:05:17,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=219356.66666666666, ans=0.05 2023-10-10 03:05:28,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=219403.33333333334, ans=0.125 2023-10-10 03:05:48,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=219450.0, ans=0.2 2023-10-10 03:05:50,307 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.744e+02 1.931e+02 2.397e+02 3.189e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-10 03:05:53,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=219496.66666666666, ans=0.125 2023-10-10 03:06:15,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=219590.0, ans=0.0 2023-10-10 03:06:20,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=219590.0, ans=0.1 2023-10-10 03:06:37,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=219683.33333333334, ans=0.125 2023-10-10 03:06:44,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=219730.0, ans=0.125 2023-10-10 03:06:56,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=219776.66666666666, ans=0.125 2023-10-10 03:07:10,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=219823.33333333334, ans=0.125 2023-10-10 03:07:13,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=219823.33333333334, ans=0.125 2023-10-10 03:07:18,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=219870.0, ans=0.0 2023-10-10 03:07:40,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.759e+02 2.070e+02 2.389e+02 3.227e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-10 03:07:53,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=220010.0, ans=0.2 2023-10-10 03:08:11,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=220103.33333333334, ans=0.2 2023-10-10 03:08:20,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=220103.33333333334, ans=0.125 2023-10-10 03:08:52,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=220290.0, ans=0.0 2023-10-10 03:08:53,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=220290.0, ans=0.025 2023-10-10 03:09:19,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=220383.33333333334, ans=0.2 2023-10-10 03:09:20,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=220383.33333333334, ans=0.0 2023-10-10 03:09:23,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=220383.33333333334, ans=0.04949747468305833 2023-10-10 03:09:28,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.792e+02 1.935e+02 2.323e+02 3.493e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-10 03:09:29,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=220430.0, ans=0.125 2023-10-10 03:09:56,870 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:10:06,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=220570.0, ans=0.0 2023-10-10 03:10:24,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=220663.33333333334, ans=0.0 2023-10-10 03:10:32,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=220663.33333333334, ans=0.2 2023-10-10 03:10:32,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=220663.33333333334, ans=0.125 2023-10-10 03:10:40,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=220710.0, ans=0.125 2023-10-10 03:10:49,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=15.0 2023-10-10 03:10:59,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.80 vs. limit=12.0 2023-10-10 03:11:03,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=220803.33333333334, ans=15.0 2023-10-10 03:11:09,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=220803.33333333334, ans=0.125 2023-10-10 03:11:18,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220850.0, ans=0.1 2023-10-10 03:11:23,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-10-10 03:11:29,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.874e+02 2.188e+02 2.548e+02 4.224e+02, threshold=4.376e+02, percent-clipped=2.0 2023-10-10 03:11:34,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-10-10 03:11:38,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=220943.33333333334, ans=0.125 2023-10-10 03:12:18,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=221083.33333333334, ans=0.2 2023-10-10 03:12:39,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221176.66666666666, ans=0.1 2023-10-10 03:12:50,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=221223.33333333334, ans=0.0 2023-10-10 03:13:18,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=15.0 2023-10-10 03:13:20,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.696e+02 1.860e+02 2.222e+02 3.148e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-10 03:13:46,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=221456.66666666666, ans=0.125 2023-10-10 03:13:50,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=221456.66666666666, ans=0.125 2023-10-10 03:13:52,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=221456.66666666666, ans=0.125 2023-10-10 03:13:56,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=221456.66666666666, ans=0.07 2023-10-10 03:13:59,004 INFO [train.py:1031] (3/4) Epoch 4, batch 6500, loss[loss=0.2449, simple_loss=0.3246, pruned_loss=0.08263, over 16815.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3263, pruned_loss=0.0843, over 31578787.65 frames. ], batch size: 116, lr: 1.00e-02, grad_scale: 16.0 2023-10-10 03:14:37,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=221643.33333333334, ans=0.0 2023-10-10 03:14:42,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=221643.33333333334, ans=0.0 2023-10-10 03:15:06,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=221736.66666666666, ans=0.1 2023-10-10 03:15:18,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2023-10-10 03:15:19,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=221783.33333333334, ans=0.0 2023-10-10 03:15:22,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=221783.33333333334, ans=0.125 2023-10-10 03:15:30,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.798e+02 2.057e+02 2.350e+02 2.984e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-10 03:15:38,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2023-10-10 03:15:56,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221923.33333333334, ans=0.1 2023-10-10 03:16:00,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=221923.33333333334, ans=0.0 2023-10-10 03:16:27,507 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:16:48,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.13 vs. limit=22.5 2023-10-10 03:16:49,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=222156.66666666666, ans=0.125 2023-10-10 03:17:20,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.811e+02 2.033e+02 2.390e+02 3.612e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-10 03:17:24,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=222296.66666666666, ans=0.125 2023-10-10 03:17:42,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=222390.0, ans=0.125 2023-10-10 03:17:56,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.57 vs. limit=12.0 2023-10-10 03:17:57,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=222436.66666666666, ans=0.2 2023-10-10 03:18:01,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=22.5 2023-10-10 03:18:13,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222530.0, ans=0.125 2023-10-10 03:18:34,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=222623.33333333334, ans=0.0 2023-10-10 03:18:43,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=222670.0, ans=0.02 2023-10-10 03:18:50,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.91 vs. limit=10.0 2023-10-10 03:18:53,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222670.0, ans=0.1 2023-10-10 03:18:54,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=222670.0, ans=0.125 2023-10-10 03:19:11,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.820e+02 2.036e+02 2.562e+02 3.931e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-10 03:19:11,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=222763.33333333334, ans=0.015 2023-10-10 03:19:12,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=222763.33333333334, ans=0.125 2023-10-10 03:19:17,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=222763.33333333334, ans=0.125 2023-10-10 03:19:27,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=222810.0, ans=0.0 2023-10-10 03:19:32,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=222856.66666666666, ans=0.125 2023-10-10 03:19:38,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-10-10 03:19:41,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222856.66666666666, ans=0.1 2023-10-10 03:19:42,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=222856.66666666666, ans=0.125 2023-10-10 03:20:11,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=222950.0, ans=0.0 2023-10-10 03:20:17,156 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:20:30,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=223043.33333333334, ans=0.125 2023-10-10 03:20:46,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=223090.0, ans=0.95 2023-10-10 03:20:46,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=223090.0, ans=0.125 2023-10-10 03:20:52,689 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.03 vs. limit=15.0 2023-10-10 03:21:01,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-10-10 03:21:18,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.715e+02 1.972e+02 2.494e+02 3.331e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-10 03:21:23,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=15.0 2023-10-10 03:21:26,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=15.0 2023-10-10 03:21:29,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=223276.66666666666, ans=0.0 2023-10-10 03:21:31,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-10 03:21:36,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=223276.66666666666, ans=0.125 2023-10-10 03:21:37,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.07 vs. limit=15.0 2023-10-10 03:21:44,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=223323.33333333334, ans=0.125 2023-10-10 03:22:02,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=223416.66666666666, ans=0.125 2023-10-10 03:22:24,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=223510.0, ans=0.0 2023-10-10 03:22:40,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=223556.66666666666, ans=0.0 2023-10-10 03:22:43,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=223556.66666666666, ans=15.0 2023-10-10 03:23:08,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.723e+02 1.980e+02 2.256e+02 3.852e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-10 03:23:11,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.09 vs. limit=22.5 2023-10-10 03:23:13,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=223696.66666666666, ans=0.2 2023-10-10 03:23:17,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=223743.33333333334, ans=0.125 2023-10-10 03:23:24,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223743.33333333334, ans=0.0 2023-10-10 03:23:34,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=223790.0, ans=0.125 2023-10-10 03:23:37,858 INFO [train.py:1031] (3/4) Epoch 4, batch 7000, loss[loss=0.2481, simple_loss=0.3319, pruned_loss=0.08213, over 16675.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3263, pruned_loss=0.08383, over 31873297.75 frames. ], batch size: 202, lr: 9.95e-03, grad_scale: 16.0 2023-10-10 03:24:19,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=223976.66666666666, ans=0.125 2023-10-10 03:24:45,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-10-10 03:24:48,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=224070.0, ans=0.2 2023-10-10 03:25:04,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 2.036e+02 2.329e+02 2.722e+02 3.424e+02, threshold=4.659e+02, percent-clipped=0.0 2023-10-10 03:25:05,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=224163.33333333334, ans=22.5 2023-10-10 03:25:05,193 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.44 vs. limit=10.0 2023-10-10 03:25:07,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=224163.33333333334, ans=0.0 2023-10-10 03:25:18,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224210.0, ans=0.1 2023-10-10 03:25:22,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=224256.66666666666, ans=0.125 2023-10-10 03:25:29,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=224256.66666666666, ans=0.125 2023-10-10 03:25:39,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.54 vs. limit=22.5 2023-10-10 03:26:08,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=224443.33333333334, ans=0.04949747468305833 2023-10-10 03:26:10,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=224443.33333333334, ans=0.125 2023-10-10 03:26:12,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=224443.33333333334, ans=0.025 2023-10-10 03:26:15,620 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-10-10 03:26:31,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=224536.66666666666, ans=0.0 2023-10-10 03:26:39,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=224536.66666666666, ans=0.2 2023-10-10 03:26:55,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.783e+02 2.126e+02 2.446e+02 3.555e+02, threshold=4.252e+02, percent-clipped=0.0 2023-10-10 03:26:57,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=224630.0, ans=0.1 2023-10-10 03:26:59,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=224630.0, ans=0.125 2023-10-10 03:27:09,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=224676.66666666666, ans=0.07 2023-10-10 03:28:04,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224863.33333333334, ans=0.1 2023-10-10 03:28:31,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=224956.66666666666, ans=0.125 2023-10-10 03:28:38,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=225003.33333333334, ans=0.125 2023-10-10 03:28:48,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-10 03:28:57,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=225050.0, ans=0.125 2023-10-10 03:29:02,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.735e+02 1.944e+02 2.216e+02 3.505e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-10 03:29:07,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225096.66666666666, ans=0.1 2023-10-10 03:29:08,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=225096.66666666666, ans=0.0 2023-10-10 03:29:29,805 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:29:44,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225283.33333333334, ans=0.1 2023-10-10 03:30:00,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=225330.0, ans=0.2 2023-10-10 03:30:13,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=225376.66666666666, ans=0.0 2023-10-10 03:30:31,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=225470.0, ans=0.125 2023-10-10 03:30:42,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=225516.66666666666, ans=0.125 2023-10-10 03:30:58,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.725e+02 1.965e+02 2.229e+02 3.831e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-10 03:31:01,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.37 vs. limit=15.0 2023-10-10 03:31:02,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-10-10 03:31:07,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=225610.0, ans=0.1 2023-10-10 03:31:12,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-10 03:32:19,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=225890.0, ans=0.125 2023-10-10 03:32:42,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=226030.0, ans=0.0 2023-10-10 03:32:44,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=226030.0, ans=0.0 2023-10-10 03:32:45,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.808e+02 1.976e+02 2.266e+02 3.262e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-10 03:32:48,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-10-10 03:33:12,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=226123.33333333334, ans=10.0 2023-10-10 03:33:18,303 INFO [train.py:1031] (3/4) Epoch 4, batch 7500, loss[loss=0.2155, simple_loss=0.3009, pruned_loss=0.06503, over 15787.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3262, pruned_loss=0.08399, over 32078599.08 frames. ], batch size: 43, lr: 9.90e-03, grad_scale: 32.0 2023-10-10 03:33:31,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=226216.66666666666, ans=0.125 2023-10-10 03:33:34,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=226216.66666666666, ans=0.125 2023-10-10 03:33:58,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=226310.0, ans=0.125 2023-10-10 03:33:58,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=226310.0, ans=0.0 2023-10-10 03:34:07,093 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:34:22,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=226403.33333333334, ans=0.125 2023-10-10 03:34:38,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.862e+02 2.261e+02 2.638e+02 3.951e+02, threshold=4.522e+02, percent-clipped=0.0 2023-10-10 03:34:47,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=226543.33333333334, ans=0.125 2023-10-10 03:34:48,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=226543.33333333334, ans=0.0 2023-10-10 03:35:08,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=226636.66666666666, ans=0.2 2023-10-10 03:35:10,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=226636.66666666666, ans=0.125 2023-10-10 03:35:13,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=226636.66666666666, ans=0.0 2023-10-10 03:36:11,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=226823.33333333334, ans=0.0 2023-10-10 03:36:14,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=226870.0, ans=0.0 2023-10-10 03:36:34,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=226916.66666666666, ans=0.125 2023-10-10 03:36:39,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.749e+02 1.966e+02 2.272e+02 3.303e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-10 03:36:40,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=226963.33333333334, ans=0.0 2023-10-10 03:36:55,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-10 03:37:00,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=227056.66666666666, ans=0.125 2023-10-10 03:37:12,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227103.33333333334, ans=0.1 2023-10-10 03:37:23,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=227150.0, ans=0.125 2023-10-10 03:37:27,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227150.0, ans=0.1 2023-10-10 03:37:27,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=227150.0, ans=0.125 2023-10-10 03:37:28,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=227150.0, ans=0.015 2023-10-10 03:37:32,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.94 vs. limit=22.5 2023-10-10 03:37:33,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=227196.66666666666, ans=0.2 2023-10-10 03:37:37,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=227196.66666666666, ans=0.0 2023-10-10 03:37:38,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-10-10 03:37:46,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=227243.33333333334, ans=0.2 2023-10-10 03:37:47,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=15.0 2023-10-10 03:37:49,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=227243.33333333334, ans=0.125 2023-10-10 03:37:51,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227243.33333333334, ans=0.1 2023-10-10 03:37:54,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=227290.0, ans=0.0 2023-10-10 03:38:20,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227383.33333333334, ans=0.1 2023-10-10 03:38:30,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.760e+02 1.933e+02 2.192e+02 2.980e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-10 03:38:35,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227476.66666666666, ans=0.1 2023-10-10 03:38:38,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=227476.66666666666, ans=0.0 2023-10-10 03:38:45,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=227476.66666666666, ans=0.125 2023-10-10 03:38:50,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=227523.33333333334, ans=0.0 2023-10-10 03:38:51,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=227523.33333333334, ans=0.2 2023-10-10 03:39:01,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-10-10 03:39:18,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-10-10 03:39:19,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-10-10 03:39:32,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=227663.33333333334, ans=0.125 2023-10-10 03:39:33,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=227663.33333333334, ans=0.0 2023-10-10 03:39:45,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=227710.0, ans=0.125 2023-10-10 03:39:50,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=227756.66666666666, ans=0.125 2023-10-10 03:39:52,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=227756.66666666666, ans=0.125 2023-10-10 03:40:04,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=227803.33333333334, ans=0.0 2023-10-10 03:40:04,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=227803.33333333334, ans=0.125 2023-10-10 03:40:06,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=227803.33333333334, ans=0.0 2023-10-10 03:40:07,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=227803.33333333334, ans=0.2 2023-10-10 03:40:10,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=227850.0, ans=0.07 2023-10-10 03:40:24,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=227896.66666666666, ans=0.2 2023-10-10 03:40:27,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.776e+02 1.999e+02 2.186e+02 3.042e+02, threshold=3.998e+02, percent-clipped=0.0 2023-10-10 03:40:48,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=227990.0, ans=0.2 2023-10-10 03:40:51,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-10-10 03:40:52,486 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.87 vs. limit=12.0 2023-10-10 03:40:59,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=228036.66666666666, ans=0.0 2023-10-10 03:41:22,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=228130.0, ans=0.0 2023-10-10 03:41:33,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.63 vs. limit=10.0 2023-10-10 03:41:40,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=228176.66666666666, ans=0.125 2023-10-10 03:41:53,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.42 vs. limit=22.5 2023-10-10 03:42:11,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=228316.66666666666, ans=0.0 2023-10-10 03:42:20,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.704e+02 1.926e+02 2.230e+02 2.916e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-10 03:42:31,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=228410.0, ans=0.2 2023-10-10 03:42:48,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228456.66666666666, ans=0.1 2023-10-10 03:42:51,981 INFO [train.py:1031] (3/4) Epoch 4, batch 8000, loss[loss=0.2194, simple_loss=0.2803, pruned_loss=0.07922, over 12622.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3252, pruned_loss=0.0832, over 32224269.38 frames. ], batch size: 440, lr: 9.85e-03, grad_scale: 32.0 2023-10-10 03:42:52,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=228503.33333333334, ans=0.125 2023-10-10 03:43:06,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=228550.0, ans=0.2 2023-10-10 03:43:10,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=228550.0, ans=0.125 2023-10-10 03:43:15,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=228596.66666666666, ans=0.0 2023-10-10 03:43:18,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=228596.66666666666, ans=0.125 2023-10-10 03:43:34,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=228690.0, ans=0.1 2023-10-10 03:43:37,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=228690.0, ans=0.0 2023-10-10 03:43:38,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.17 vs. limit=12.0 2023-10-10 03:43:44,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=228736.66666666666, ans=0.125 2023-10-10 03:44:00,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=228783.33333333334, ans=0.2 2023-10-10 03:44:09,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=228830.0, ans=0.125 2023-10-10 03:44:11,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.799e+02 2.032e+02 2.377e+02 3.642e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-10 03:44:12,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=6.0 2023-10-10 03:44:15,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228876.66666666666, ans=0.125 2023-10-10 03:44:20,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228876.66666666666, ans=0.1 2023-10-10 03:44:25,074 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:44:31,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=228923.33333333334, ans=0.0 2023-10-10 03:44:58,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=229063.33333333334, ans=10.0 2023-10-10 03:45:13,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-10 03:45:19,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.03 vs. limit=15.0 2023-10-10 03:45:26,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-10-10 03:45:43,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229203.33333333334, ans=0.1 2023-10-10 03:46:10,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.852e+02 2.105e+02 2.392e+02 3.838e+02, threshold=4.209e+02, percent-clipped=0.0 2023-10-10 03:46:29,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=229390.0, ans=0.125 2023-10-10 03:46:40,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=229390.0, ans=0.0 2023-10-10 03:46:44,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=22.5 2023-10-10 03:46:45,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=229436.66666666666, ans=0.0 2023-10-10 03:46:46,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=229436.66666666666, ans=0.125 2023-10-10 03:46:57,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=229483.33333333334, ans=0.05 2023-10-10 03:46:59,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=229483.33333333334, ans=0.125 2023-10-10 03:47:17,331 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:47:18,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=229576.66666666666, ans=0.2 2023-10-10 03:47:32,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=229623.33333333334, ans=0.05 2023-10-10 03:47:34,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=229623.33333333334, ans=0.2 2023-10-10 03:47:37,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=229670.0, ans=0.125 2023-10-10 03:47:58,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=22.5 2023-10-10 03:47:59,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=229716.66666666666, ans=0.125 2023-10-10 03:48:06,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.789e+02 1.998e+02 2.469e+02 3.648e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-10 03:48:34,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=229903.33333333334, ans=0.125 2023-10-10 03:48:45,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229903.33333333334, ans=0.1 2023-10-10 03:48:49,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=229950.0, ans=0.2 2023-10-10 03:48:56,576 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:49:04,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-10-10 03:49:28,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230090.0, ans=0.1 2023-10-10 03:49:30,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-10 03:49:44,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=230183.33333333334, ans=0.125 2023-10-10 03:49:53,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230230.0, ans=0.1 2023-10-10 03:49:58,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.818e+02 2.057e+02 2.273e+02 3.432e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-10 03:50:44,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=230416.66666666666, ans=0.125 2023-10-10 03:50:48,193 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-10-10 03:50:48,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=230416.66666666666, ans=0.0 2023-10-10 03:51:00,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=230463.33333333334, ans=0.125 2023-10-10 03:51:00,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=230463.33333333334, ans=0.125 2023-10-10 03:51:16,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=230556.66666666666, ans=0.0 2023-10-10 03:51:24,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=230556.66666666666, ans=0.125 2023-10-10 03:51:27,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=230603.33333333334, ans=0.125 2023-10-10 03:51:34,023 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:51:35,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=230603.33333333334, ans=0.125 2023-10-10 03:51:54,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.747e+02 2.004e+02 2.385e+02 3.225e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 03:52:11,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=230743.33333333334, ans=0.125 2023-10-10 03:52:15,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=230790.0, ans=0.125 2023-10-10 03:52:17,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=230790.0, ans=0.125 2023-10-10 03:52:17,749 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.93 vs. limit=15.0 2023-10-10 03:52:26,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-10-10 03:52:28,056 INFO [train.py:1031] (3/4) Epoch 4, batch 8500, loss[loss=0.251, simple_loss=0.3293, pruned_loss=0.08641, over 16840.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3251, pruned_loss=0.08294, over 32335833.33 frames. ], batch size: 98, lr: 9.80e-03, grad_scale: 32.0 2023-10-10 03:52:28,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=230836.66666666666, ans=0.07 2023-10-10 03:52:29,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=230836.66666666666, ans=0.125 2023-10-10 03:53:04,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=230976.66666666666, ans=0.0 2023-10-10 03:53:06,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=230976.66666666666, ans=0.2 2023-10-10 03:53:36,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=231116.66666666666, ans=0.125 2023-10-10 03:53:41,069 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2023-10-10 03:53:54,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 2.024e+02 2.239e+02 2.613e+02 4.171e+02, threshold=4.477e+02, percent-clipped=2.0 2023-10-10 03:53:59,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=231210.0, ans=0.125 2023-10-10 03:54:05,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=231210.0, ans=0.2 2023-10-10 03:54:12,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-10-10 03:54:53,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=231396.66666666666, ans=0.0 2023-10-10 03:54:54,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=231396.66666666666, ans=0.1 2023-10-10 03:54:59,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=231396.66666666666, ans=0.1 2023-10-10 03:55:31,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=231536.66666666666, ans=0.125 2023-10-10 03:55:32,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=231536.66666666666, ans=0.0 2023-10-10 03:55:56,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.708e+02 1.890e+02 2.186e+02 3.030e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-10 03:55:57,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=231630.0, ans=0.0 2023-10-10 03:56:08,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=231676.66666666666, ans=0.0 2023-10-10 03:56:31,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.63 vs. limit=6.0 2023-10-10 03:56:35,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=231770.0, ans=0.0 2023-10-10 03:56:37,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.19 vs. limit=15.0 2023-10-10 03:56:47,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-10-10 03:56:54,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=231863.33333333334, ans=0.0 2023-10-10 03:57:11,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=231910.0, ans=0.1 2023-10-10 03:57:11,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.57 vs. limit=22.5 2023-10-10 03:57:20,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231956.66666666666, ans=0.1 2023-10-10 03:57:58,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.651e+02 1.866e+02 2.101e+02 3.312e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-10 03:58:20,376 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=12.0 2023-10-10 03:58:23,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.66 vs. limit=15.0 2023-10-10 03:58:26,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232236.66666666666, ans=0.1 2023-10-10 03:58:29,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.90 vs. limit=15.0 2023-10-10 03:58:37,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=232283.33333333334, ans=0.0 2023-10-10 03:58:57,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=232376.66666666666, ans=0.125 2023-10-10 03:59:03,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=232376.66666666666, ans=0.0 2023-10-10 03:59:34,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=232516.66666666666, ans=0.125 2023-10-10 03:59:35,441 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:59:42,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.765e+02 1.915e+02 2.217e+02 3.216e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-10 03:59:45,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=232563.33333333334, ans=0.125 2023-10-10 04:00:00,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-10-10 04:00:13,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-10-10 04:00:30,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=232750.0, ans=0.0 2023-10-10 04:00:34,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-10-10 04:00:46,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-10-10 04:00:59,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=232890.0, ans=0.07 2023-10-10 04:01:00,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.86 vs. limit=22.5 2023-10-10 04:01:03,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=232936.66666666666, ans=0.125 2023-10-10 04:01:06,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232936.66666666666, ans=0.1 2023-10-10 04:01:23,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2023-10-10 04:01:28,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=233030.0, ans=0.125 2023-10-10 04:01:32,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=233030.0, ans=0.125 2023-10-10 04:01:33,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=233030.0, ans=0.125 2023-10-10 04:01:34,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.886e+02 2.150e+02 2.647e+02 3.755e+02, threshold=4.299e+02, percent-clipped=0.0 2023-10-10 04:01:36,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-10 04:01:36,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=233030.0, ans=0.125 2023-10-10 04:01:55,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.13 vs. limit=22.5 2023-10-10 04:01:59,211 INFO [train.py:1031] (3/4) Epoch 4, batch 9000, loss[loss=0.2539, simple_loss=0.3036, pruned_loss=0.1021, over 12305.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3239, pruned_loss=0.08204, over 32458269.59 frames. ], batch size: 440, lr: 9.75e-03, grad_scale: 16.0 2023-10-10 04:02:11,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=233216.66666666666, ans=0.125 2023-10-10 04:02:43,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=233356.66666666666, ans=0.125 2023-10-10 04:02:46,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=233356.66666666666, ans=15.0 2023-10-10 04:02:49,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233356.66666666666, ans=0.1 2023-10-10 04:02:52,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=233403.33333333334, ans=0.0 2023-10-10 04:03:12,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=233496.66666666666, ans=0.125 2023-10-10 04:03:14,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=233496.66666666666, ans=0.125 2023-10-10 04:03:17,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.810e+02 2.071e+02 2.319e+02 3.313e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-10 04:03:21,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=233543.33333333334, ans=10.0 2023-10-10 04:03:23,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.64 vs. limit=15.0 2023-10-10 04:03:33,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.75 vs. limit=6.0 2023-10-10 04:04:25,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=233823.33333333334, ans=0.04949747468305833 2023-10-10 04:04:38,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.95 vs. limit=22.5 2023-10-10 04:04:41,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.00 vs. limit=10.0 2023-10-10 04:04:43,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=233870.0, ans=0.1 2023-10-10 04:05:04,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.852e+02 2.032e+02 2.274e+02 3.789e+02, threshold=4.063e+02, percent-clipped=0.0 2023-10-10 04:05:18,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=234056.66666666666, ans=0.125 2023-10-10 04:05:22,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=234056.66666666666, ans=0.125 2023-10-10 04:05:24,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=234056.66666666666, ans=0.125 2023-10-10 04:05:45,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=234150.0, ans=0.125 2023-10-10 04:05:48,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=234196.66666666666, ans=0.0 2023-10-10 04:05:49,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234196.66666666666, ans=0.1 2023-10-10 04:05:52,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=234196.66666666666, ans=0.0 2023-10-10 04:06:02,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=234243.33333333334, ans=0.07 2023-10-10 04:06:05,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234243.33333333334, ans=0.1 2023-10-10 04:06:06,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=15.0 2023-10-10 04:06:19,781 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:06:22,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=234336.66666666666, ans=0.0 2023-10-10 04:06:29,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.94 vs. limit=10.0 2023-10-10 04:06:42,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=234430.0, ans=0.07 2023-10-10 04:06:47,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.850e+02 2.060e+02 2.509e+02 3.758e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-10 04:07:08,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234523.33333333334, ans=0.1 2023-10-10 04:07:29,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=234616.66666666666, ans=0.0 2023-10-10 04:07:29,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=234616.66666666666, ans=0.125 2023-10-10 04:07:30,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=234616.66666666666, ans=0.0 2023-10-10 04:07:46,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=234710.0, ans=0.0 2023-10-10 04:07:52,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=234710.0, ans=0.0 2023-10-10 04:07:52,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=234710.0, ans=0.0 2023-10-10 04:07:54,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=234710.0, ans=0.2 2023-10-10 04:08:07,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=234756.66666666666, ans=0.125 2023-10-10 04:08:12,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=234803.33333333334, ans=0.125 2023-10-10 04:08:13,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=234803.33333333334, ans=0.0 2023-10-10 04:08:14,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=234803.33333333334, ans=0.025 2023-10-10 04:08:37,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.64 vs. limit=22.5 2023-10-10 04:08:47,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.885e+02 2.132e+02 2.388e+02 3.548e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-10 04:08:54,003 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:09:05,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=234990.0, ans=0.0 2023-10-10 04:09:16,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-10-10 04:09:26,316 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:09:38,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=235130.0, ans=0.125 2023-10-10 04:09:47,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=235176.66666666666, ans=0.0 2023-10-10 04:09:58,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-10-10 04:09:58,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=235223.33333333334, ans=0.0 2023-10-10 04:10:04,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=235223.33333333334, ans=0.125 2023-10-10 04:10:05,954 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:10:19,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=235270.0, ans=0.0 2023-10-10 04:10:39,189 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.815e+02 1.961e+02 2.330e+02 3.458e+02, threshold=3.921e+02, percent-clipped=0.0 2023-10-10 04:10:43,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=235410.0, ans=0.125 2023-10-10 04:10:59,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=235456.66666666666, ans=0.125 2023-10-10 04:11:06,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235456.66666666666, ans=0.1 2023-10-10 04:11:08,337 INFO [train.py:1031] (3/4) Epoch 4, batch 9500, loss[loss=0.2802, simple_loss=0.3553, pruned_loss=0.1025, over 16106.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3244, pruned_loss=0.08214, over 32530957.24 frames. ], batch size: 296, lr: 9.70e-03, grad_scale: 32.0 2023-10-10 04:11:09,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=22.5 2023-10-10 04:11:13,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=235503.33333333334, ans=0.0 2023-10-10 04:11:17,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=235550.0, ans=0.125 2023-10-10 04:11:19,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=235550.0, ans=0.125 2023-10-10 04:11:24,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=235550.0, ans=10.0 2023-10-10 04:11:53,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=235690.0, ans=0.2 2023-10-10 04:11:58,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=235690.0, ans=0.125 2023-10-10 04:11:59,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=235690.0, ans=0.125 2023-10-10 04:12:05,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=235736.66666666666, ans=0.125 2023-10-10 04:12:30,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.840e+02 2.026e+02 2.333e+02 4.133e+02, threshold=4.052e+02, percent-clipped=1.0 2023-10-10 04:12:41,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=22.5 2023-10-10 04:13:14,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.51 vs. limit=10.0 2023-10-10 04:13:15,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=236016.66666666666, ans=0.0 2023-10-10 04:13:29,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=236110.0, ans=0.125 2023-10-10 04:13:37,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=236110.0, ans=0.2 2023-10-10 04:13:59,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-10-10 04:14:01,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-10-10 04:14:12,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=236250.0, ans=0.125 2023-10-10 04:14:14,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=236250.0, ans=0.2 2023-10-10 04:14:17,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=22.5 2023-10-10 04:14:19,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=236296.66666666666, ans=0.1 2023-10-10 04:14:22,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.773e+02 2.111e+02 2.427e+02 4.167e+02, threshold=4.221e+02, percent-clipped=1.0 2023-10-10 04:14:24,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-10-10 04:14:27,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-10-10 04:15:06,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.10 vs. limit=15.0 2023-10-10 04:15:20,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=236530.0, ans=0.125 2023-10-10 04:15:25,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=236576.66666666666, ans=0.125 2023-10-10 04:15:29,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=236576.66666666666, ans=0.125 2023-10-10 04:15:50,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236670.0, ans=0.1 2023-10-10 04:16:06,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-10-10 04:16:11,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.787e+02 1.935e+02 2.241e+02 2.810e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-10 04:16:29,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=236856.66666666666, ans=0.0 2023-10-10 04:16:29,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-10 04:16:43,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=236903.33333333334, ans=0.125 2023-10-10 04:17:00,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=236996.66666666666, ans=0.125 2023-10-10 04:17:03,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=236996.66666666666, ans=0.2 2023-10-10 04:17:15,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=237043.33333333334, ans=0.1 2023-10-10 04:17:22,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=237090.0, ans=0.125 2023-10-10 04:17:47,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=237183.33333333334, ans=0.125 2023-10-10 04:17:51,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=237183.33333333334, ans=0.0 2023-10-10 04:18:01,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.719e+02 1.865e+02 2.083e+02 3.014e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-10 04:18:29,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237370.0, ans=0.1 2023-10-10 04:18:31,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.07 vs. limit=22.5 2023-10-10 04:18:48,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=237463.33333333334, ans=0.1 2023-10-10 04:18:56,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-10-10 04:19:11,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=237556.66666666666, ans=0.125 2023-10-10 04:19:29,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=237603.33333333334, ans=10.0 2023-10-10 04:19:46,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=237696.66666666666, ans=0.125 2023-10-10 04:19:47,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=237696.66666666666, ans=0.0 2023-10-10 04:19:48,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.767e+02 1.940e+02 2.292e+02 3.148e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 04:20:04,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=237790.0, ans=0.0 2023-10-10 04:20:04,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=237790.0, ans=0.125 2023-10-10 04:20:12,415 INFO [train.py:1031] (3/4) Epoch 4, batch 10000, loss[loss=0.2414, simple_loss=0.3254, pruned_loss=0.07877, over 16871.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3232, pruned_loss=0.08174, over 32573947.51 frames. ], batch size: 130, lr: 9.66e-03, grad_scale: 32.0 2023-10-10 04:20:24,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=237883.33333333334, ans=0.0 2023-10-10 04:20:25,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237883.33333333334, ans=0.1 2023-10-10 04:20:28,209 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:20:37,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237930.0, ans=0.125 2023-10-10 04:20:37,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=237930.0, ans=0.0 2023-10-10 04:20:53,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=237976.66666666666, ans=0.05 2023-10-10 04:21:11,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238070.0, ans=0.1 2023-10-10 04:21:38,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.748e+02 1.967e+02 2.205e+02 3.261e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-10 04:21:39,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238163.33333333334, ans=0.1 2023-10-10 04:21:42,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=238210.0, ans=0.125 2023-10-10 04:21:45,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=238210.0, ans=0.125 2023-10-10 04:21:46,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=238210.0, ans=0.125 2023-10-10 04:21:51,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=238210.0, ans=0.0 2023-10-10 04:22:24,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=238350.0, ans=0.2 2023-10-10 04:22:25,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=238350.0, ans=0.125 2023-10-10 04:22:26,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=238350.0, ans=0.1 2023-10-10 04:22:33,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=238396.66666666666, ans=0.0 2023-10-10 04:22:47,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=238443.33333333334, ans=0.0 2023-10-10 04:22:54,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=238490.0, ans=0.2 2023-10-10 04:23:14,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=238583.33333333334, ans=0.125 2023-10-10 04:23:25,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.768e+02 1.963e+02 2.245e+02 3.262e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-10 04:23:58,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=238770.0, ans=0.0 2023-10-10 04:24:04,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=238770.0, ans=0.125 2023-10-10 04:24:17,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=238863.33333333334, ans=0.035 2023-10-10 04:24:21,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238863.33333333334, ans=0.1 2023-10-10 04:24:43,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=238956.66666666666, ans=0.125 2023-10-10 04:25:03,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=239050.0, ans=0.125 2023-10-10 04:25:03,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=239050.0, ans=0.0 2023-10-10 04:25:04,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.54 vs. limit=15.0 2023-10-10 04:25:08,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=239050.0, ans=0.125 2023-10-10 04:25:09,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.16 vs. limit=6.0 2023-10-10 04:25:09,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.99 vs. limit=22.5 2023-10-10 04:25:21,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=239096.66666666666, ans=0.04949747468305833 2023-10-10 04:25:22,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.793e+02 1.904e+02 2.266e+02 3.101e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-10 04:25:29,957 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.79 vs. limit=15.0 2023-10-10 04:25:44,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=239190.0, ans=0.125 2023-10-10 04:25:49,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-10-10 04:25:54,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=239236.66666666666, ans=0.125 2023-10-10 04:26:09,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=239283.33333333334, ans=0.09899494936611666 2023-10-10 04:26:52,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=239470.0, ans=0.125 2023-10-10 04:27:02,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=239516.66666666666, ans=0.0 2023-10-10 04:27:05,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-10-10 04:27:17,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=239563.33333333334, ans=0.2 2023-10-10 04:27:19,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.664e+02 1.873e+02 2.084e+02 2.779e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-10 04:27:49,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=239703.33333333334, ans=0.125 2023-10-10 04:28:07,791 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=22.5 2023-10-10 04:28:35,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239890.0, ans=0.125 2023-10-10 04:28:53,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239983.33333333334, ans=0.125 2023-10-10 04:29:12,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.877e+02 2.043e+02 2.342e+02 3.951e+02, threshold=4.085e+02, percent-clipped=1.0 2023-10-10 04:29:20,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.20 vs. limit=22.5 2023-10-10 04:29:26,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=240123.33333333334, ans=0.125 2023-10-10 04:29:35,619 INFO [train.py:1031] (3/4) Epoch 4, batch 10500, loss[loss=0.3335, simple_loss=0.379, pruned_loss=0.144, over 15578.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3233, pruned_loss=0.08166, over 32612407.44 frames. ], batch size: 350, lr: 9.61e-03, grad_scale: 32.0 2023-10-10 04:29:35,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240170.0, ans=0.1 2023-10-10 04:29:37,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.84 vs. limit=22.5 2023-10-10 04:29:39,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240170.0, ans=0.1 2023-10-10 04:29:47,993 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:30:17,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=240356.66666666666, ans=0.125 2023-10-10 04:30:47,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.41 vs. limit=10.0 2023-10-10 04:31:05,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=240496.66666666666, ans=0.0 2023-10-10 04:31:06,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.791e+02 1.942e+02 2.135e+02 3.272e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-10 04:31:17,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-10-10 04:31:57,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=240730.0, ans=0.07 2023-10-10 04:32:47,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240916.66666666666, ans=0.125 2023-10-10 04:33:02,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.801e+02 1.986e+02 2.288e+02 3.141e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-10 04:33:05,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241010.0, ans=0.1 2023-10-10 04:33:06,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=241010.0, ans=0.0 2023-10-10 04:33:22,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=241056.66666666666, ans=0.04949747468305833 2023-10-10 04:33:35,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.95 vs. limit=15.0 2023-10-10 04:33:37,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=241103.33333333334, ans=0.125 2023-10-10 04:33:44,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=241150.0, ans=0.015 2023-10-10 04:33:45,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-10-10 04:33:51,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=241196.66666666666, ans=0.5 2023-10-10 04:33:52,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=241196.66666666666, ans=0.0 2023-10-10 04:33:54,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=241196.66666666666, ans=0.2 2023-10-10 04:33:57,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.88 vs. limit=15.0 2023-10-10 04:34:05,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5 2023-10-10 04:34:06,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=241243.33333333334, ans=0.2 2023-10-10 04:34:24,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=241336.66666666666, ans=10.0 2023-10-10 04:34:24,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-10 04:34:25,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=241336.66666666666, ans=0.0 2023-10-10 04:34:28,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2023-10-10 04:34:35,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241383.33333333334, ans=0.1 2023-10-10 04:34:56,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.761e+02 1.994e+02 2.358e+02 4.255e+02, threshold=3.987e+02, percent-clipped=1.0 2023-10-10 04:34:57,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=241476.66666666666, ans=0.125 2023-10-10 04:34:58,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.63 vs. limit=15.0 2023-10-10 04:35:16,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-10-10 04:35:18,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=241570.0, ans=0.125 2023-10-10 04:35:28,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=241616.66666666666, ans=0.0 2023-10-10 04:35:57,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=241710.0, ans=0.07 2023-10-10 04:35:58,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2023-10-10 04:36:05,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=241756.66666666666, ans=0.125 2023-10-10 04:36:16,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=241803.33333333334, ans=0.0 2023-10-10 04:36:44,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.50 vs. limit=22.5 2023-10-10 04:36:44,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.867e+02 2.195e+02 2.598e+02 3.792e+02, threshold=4.389e+02, percent-clipped=0.0 2023-10-10 04:36:54,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=241943.33333333334, ans=0.125 2023-10-10 04:37:08,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=242036.66666666666, ans=0.2 2023-10-10 04:37:19,850 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=22.5 2023-10-10 04:37:31,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=242130.0, ans=0.125 2023-10-10 04:37:50,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=242176.66666666666, ans=0.125 2023-10-10 04:38:24,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=242363.33333333334, ans=0.0 2023-10-10 04:38:25,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=242363.33333333334, ans=0.0 2023-10-10 04:38:34,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.783e+02 2.042e+02 2.535e+02 4.634e+02, threshold=4.084e+02, percent-clipped=1.0 2023-10-10 04:38:43,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242410.0, ans=0.1 2023-10-10 04:38:48,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-10-10 04:38:48,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=242456.66666666666, ans=0.0 2023-10-10 04:38:53,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-10-10 04:38:57,390 INFO [train.py:1031] (3/4) Epoch 4, batch 11000, loss[loss=0.2832, simple_loss=0.3588, pruned_loss=0.1038, over 16879.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3234, pruned_loss=0.08164, over 32678803.51 frames. ], batch size: 165, lr: 9.56e-03, grad_scale: 32.0 2023-10-10 04:39:00,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=242503.33333333334, ans=0.125 2023-10-10 04:39:06,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242503.33333333334, ans=0.125 2023-10-10 04:39:15,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=242550.0, ans=0.125 2023-10-10 04:39:32,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242643.33333333334, ans=0.1 2023-10-10 04:40:15,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=242830.0, ans=0.125 2023-10-10 04:40:23,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.30 vs. limit=15.0 2023-10-10 04:40:24,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=242830.0, ans=0.1 2023-10-10 04:40:25,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.803e+02 2.098e+02 2.590e+02 4.045e+02, threshold=4.197e+02, percent-clipped=0.0 2023-10-10 04:40:56,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=242970.0, ans=0.95 2023-10-10 04:41:07,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.19 vs. limit=15.0 2023-10-10 04:41:07,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=243016.66666666666, ans=10.0 2023-10-10 04:41:08,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=243016.66666666666, ans=0.125 2023-10-10 04:41:09,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=243016.66666666666, ans=0.2 2023-10-10 04:41:14,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-10-10 04:41:22,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=243063.33333333334, ans=0.5 2023-10-10 04:41:22,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=243063.33333333334, ans=0.2 2023-10-10 04:41:50,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-10-10 04:42:18,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=243296.66666666666, ans=0.0 2023-10-10 04:42:28,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.899e+02 2.243e+02 2.653e+02 3.643e+02, threshold=4.486e+02, percent-clipped=0.0 2023-10-10 04:42:37,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=243343.33333333334, ans=10.0 2023-10-10 04:42:39,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=243390.0, ans=0.07 2023-10-10 04:42:48,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=243390.0, ans=0.2 2023-10-10 04:43:01,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=243483.33333333334, ans=0.2 2023-10-10 04:43:01,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=243483.33333333334, ans=0.1 2023-10-10 04:43:22,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.98 vs. limit=22.5 2023-10-10 04:43:24,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=243576.66666666666, ans=0.0 2023-10-10 04:43:30,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=243576.66666666666, ans=0.125 2023-10-10 04:43:41,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.74 vs. limit=22.5 2023-10-10 04:43:46,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=243670.0, ans=0.025 2023-10-10 04:43:58,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=243716.66666666666, ans=0.0 2023-10-10 04:44:19,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.811e+02 2.031e+02 2.306e+02 3.256e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-10 04:44:21,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=243810.0, ans=0.125 2023-10-10 04:44:22,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=243810.0, ans=0.2 2023-10-10 04:44:25,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=243810.0, ans=0.0 2023-10-10 04:44:36,410 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:44:44,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=243903.33333333334, ans=0.0 2023-10-10 04:44:53,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=243903.33333333334, ans=0.125 2023-10-10 04:45:01,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-10 04:45:04,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=243950.0, ans=0.125 2023-10-10 04:45:16,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=243996.66666666666, ans=0.125 2023-10-10 04:45:21,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=244043.33333333334, ans=0.035 2023-10-10 04:45:22,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244043.33333333334, ans=0.1 2023-10-10 04:45:23,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=244043.33333333334, ans=0.0 2023-10-10 04:45:36,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.13 vs. limit=15.0 2023-10-10 04:45:37,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.07 vs. limit=10.0 2023-10-10 04:45:45,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-10-10 04:46:07,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=244230.0, ans=0.0 2023-10-10 04:46:11,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.701e+02 1.908e+02 2.179e+02 3.314e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-10 04:46:13,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=244276.66666666666, ans=0.025 2023-10-10 04:46:23,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=244323.33333333334, ans=0.125 2023-10-10 04:47:02,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=244463.33333333334, ans=0.125 2023-10-10 04:47:05,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=244463.33333333334, ans=0.125 2023-10-10 04:47:15,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=244510.0, ans=0.125 2023-10-10 04:47:39,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=244603.33333333334, ans=0.0 2023-10-10 04:47:51,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=244650.0, ans=0.0 2023-10-10 04:48:05,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.827e+02 2.026e+02 2.318e+02 3.857e+02, threshold=4.051e+02, percent-clipped=1.0 2023-10-10 04:48:08,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=244743.33333333334, ans=0.0 2023-10-10 04:48:24,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=244790.0, ans=0.0 2023-10-10 04:48:27,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=244836.66666666666, ans=0.125 2023-10-10 04:48:28,156 INFO [train.py:1031] (3/4) Epoch 4, batch 11500, loss[loss=0.2542, simple_loss=0.3428, pruned_loss=0.08279, over 16934.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3232, pruned_loss=0.08151, over 32708189.43 frames. ], batch size: 165, lr: 9.52e-03, grad_scale: 32.0 2023-10-10 04:48:31,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=244836.66666666666, ans=0.125 2023-10-10 04:48:31,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=244836.66666666666, ans=0.07 2023-10-10 04:48:34,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.21 vs. limit=22.5 2023-10-10 04:48:40,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=244883.33333333334, ans=0.125 2023-10-10 04:48:47,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=244883.33333333334, ans=0.125 2023-10-10 04:48:56,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=244930.0, ans=0.125 2023-10-10 04:49:44,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-10-10 04:49:59,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.764e+02 1.942e+02 2.126e+02 2.780e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 04:50:13,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=245256.66666666666, ans=0.125 2023-10-10 04:50:57,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=245396.66666666666, ans=0.125 2023-10-10 04:50:59,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=245443.33333333334, ans=0.125 2023-10-10 04:51:01,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=245443.33333333334, ans=0.0 2023-10-10 04:51:15,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=245490.0, ans=0.125 2023-10-10 04:51:18,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-10-10 04:51:53,885 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.770e+02 1.988e+02 2.245e+02 3.333e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-10 04:52:00,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=245676.66666666666, ans=0.125 2023-10-10 04:52:03,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245676.66666666666, ans=0.1 2023-10-10 04:52:36,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=245863.33333333334, ans=0.0 2023-10-10 04:52:39,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-10-10 04:52:43,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=245863.33333333334, ans=0.125 2023-10-10 04:52:46,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245910.0, ans=0.1 2023-10-10 04:53:10,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=245956.66666666666, ans=0.125 2023-10-10 04:53:17,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=246003.33333333334, ans=0.2 2023-10-10 04:53:23,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=246003.33333333334, ans=0.125 2023-10-10 04:53:42,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=246096.66666666666, ans=0.125 2023-10-10 04:53:44,865 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-10-10 04:53:53,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.861e+02 2.098e+02 2.398e+02 3.266e+02, threshold=4.196e+02, percent-clipped=0.0 2023-10-10 04:54:03,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=246143.33333333334, ans=0.125 2023-10-10 04:54:32,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=246283.33333333334, ans=0.125 2023-10-10 04:54:33,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.07 vs. limit=15.0 2023-10-10 04:54:53,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=246376.66666666666, ans=0.09899494936611666 2023-10-10 04:54:58,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.23 vs. limit=15.0 2023-10-10 04:54:59,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=246376.66666666666, ans=0.125 2023-10-10 04:55:01,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=246376.66666666666, ans=0.09899494936611666 2023-10-10 04:55:32,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=246516.66666666666, ans=0.09899494936611666 2023-10-10 04:55:48,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.845e+02 2.115e+02 2.356e+02 3.431e+02, threshold=4.230e+02, percent-clipped=0.0 2023-10-10 04:55:51,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=246610.0, ans=0.1 2023-10-10 04:56:01,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=246610.0, ans=0.125 2023-10-10 04:56:13,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=246656.66666666666, ans=0.125 2023-10-10 04:56:13,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=22.5 2023-10-10 04:56:16,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=246703.33333333334, ans=0.0 2023-10-10 04:56:39,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=246796.66666666666, ans=0.0 2023-10-10 04:56:47,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2023-10-10 04:56:51,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=246843.33333333334, ans=0.2 2023-10-10 04:57:21,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-10-10 04:57:22,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=246983.33333333334, ans=0.125 2023-10-10 04:57:32,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=247030.0, ans=0.2 2023-10-10 04:57:42,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=247076.66666666666, ans=0.2 2023-10-10 04:57:42,743 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.758e+02 1.941e+02 2.126e+02 3.214e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-10 04:57:45,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247076.66666666666, ans=0.1 2023-10-10 04:57:53,837 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-10-10 04:57:56,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=247123.33333333334, ans=0.2 2023-10-10 04:58:05,062 INFO [train.py:1031] (3/4) Epoch 4, batch 12000, loss[loss=0.2434, simple_loss=0.3261, pruned_loss=0.08033, over 16847.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3229, pruned_loss=0.08104, over 32727293.08 frames. ], batch size: 175, lr: 9.48e-03, grad_scale: 32.0 2023-10-10 04:58:43,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=247310.0, ans=0.0 2023-10-10 04:58:44,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247310.0, ans=0.1 2023-10-10 04:58:54,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=247356.66666666666, ans=0.2 2023-10-10 04:59:00,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=247356.66666666666, ans=10.0 2023-10-10 04:59:00,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=247356.66666666666, ans=0.125 2023-10-10 04:59:10,392 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:59:30,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=247496.66666666666, ans=10.0 2023-10-10 04:59:31,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=247496.66666666666, ans=0.0 2023-10-10 04:59:36,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=247543.33333333334, ans=0.2 2023-10-10 04:59:36,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-10-10 04:59:36,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.741e+02 1.994e+02 2.295e+02 3.182e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-10 04:59:37,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-10-10 04:59:39,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-10 05:00:04,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=247636.66666666666, ans=0.0 2023-10-10 05:00:05,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=247636.66666666666, ans=0.125 2023-10-10 05:00:12,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.64 vs. limit=15.0 2023-10-10 05:00:13,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=247683.33333333334, ans=0.125 2023-10-10 05:00:14,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-10-10 05:00:20,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=247730.0, ans=0.125 2023-10-10 05:00:20,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.88 vs. limit=15.0 2023-10-10 05:00:20,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=247730.0, ans=0.125 2023-10-10 05:00:23,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=247730.0, ans=0.0 2023-10-10 05:00:26,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=247730.0, ans=0.125 2023-10-10 05:00:35,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=247776.66666666666, ans=0.0 2023-10-10 05:00:40,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=247823.33333333334, ans=0.1 2023-10-10 05:00:53,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=247870.0, ans=0.125 2023-10-10 05:01:09,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.34 vs. limit=10.0 2023-10-10 05:01:21,171 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:01:22,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.735e+02 1.975e+02 2.171e+02 3.178e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-10 05:01:47,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=248103.33333333334, ans=0.125 2023-10-10 05:01:49,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=248103.33333333334, ans=0.125 2023-10-10 05:02:01,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=248150.0, ans=0.04949747468305833 2023-10-10 05:02:05,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=248196.66666666666, ans=0.125 2023-10-10 05:02:07,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.04 vs. limit=22.5 2023-10-10 05:02:16,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=248243.33333333334, ans=0.0 2023-10-10 05:02:19,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=248243.33333333334, ans=0.125 2023-10-10 05:02:23,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=248243.33333333334, ans=0.125 2023-10-10 05:02:23,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-10-10 05:02:33,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=248290.0, ans=0.2 2023-10-10 05:02:43,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=248336.66666666666, ans=0.125 2023-10-10 05:03:04,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=248430.0, ans=0.125 2023-10-10 05:03:08,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.797e+02 1.984e+02 2.225e+02 3.698e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-10 05:03:11,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=248476.66666666666, ans=0.2 2023-10-10 05:03:42,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=248616.66666666666, ans=0.2 2023-10-10 05:04:06,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=248710.0, ans=0.2 2023-10-10 05:04:10,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.93 vs. limit=6.0 2023-10-10 05:04:43,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=248850.0, ans=0.125 2023-10-10 05:04:53,676 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:04:54,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248896.66666666666, ans=0.1 2023-10-10 05:04:57,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=248896.66666666666, ans=0.125 2023-10-10 05:04:58,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=248896.66666666666, ans=0.0 2023-10-10 05:05:02,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.769e+02 2.035e+02 2.348e+02 3.620e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-10 05:05:20,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.64 vs. limit=15.0 2023-10-10 05:05:31,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.44 vs. limit=22.5 2023-10-10 05:05:32,578 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:05:46,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249130.0, ans=0.125 2023-10-10 05:05:59,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-10-10 05:06:08,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=249223.33333333334, ans=0.2 2023-10-10 05:06:38,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=249316.66666666666, ans=0.0 2023-10-10 05:06:54,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.768e+02 1.991e+02 2.226e+02 3.170e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-10 05:07:09,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=249456.66666666666, ans=0.125 2023-10-10 05:07:18,498 INFO [train.py:1031] (3/4) Epoch 4, batch 12500, loss[loss=0.2605, simple_loss=0.3379, pruned_loss=0.09157, over 16674.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3225, pruned_loss=0.08084, over 32783522.92 frames. ], batch size: 202, lr: 9.43e-03, grad_scale: 32.0 2023-10-10 05:07:41,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=249596.66666666666, ans=0.125 2023-10-10 05:07:51,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=249643.33333333334, ans=0.0 2023-10-10 05:07:53,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=249643.33333333334, ans=0.125 2023-10-10 05:07:59,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=249643.33333333334, ans=0.125 2023-10-10 05:08:00,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-10 05:08:05,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=249690.0, ans=0.125 2023-10-10 05:08:12,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=249736.66666666666, ans=0.0 2023-10-10 05:08:13,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=249736.66666666666, ans=0.2 2023-10-10 05:08:15,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249736.66666666666, ans=0.1 2023-10-10 05:08:19,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=22.5 2023-10-10 05:08:26,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249783.33333333334, ans=0.1 2023-10-10 05:08:37,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=249830.0, ans=0.0 2023-10-10 05:08:40,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.98 vs. limit=15.0 2023-10-10 05:08:42,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=249830.0, ans=0.0 2023-10-10 05:08:44,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.722e+02 1.899e+02 2.216e+02 4.050e+02, threshold=3.797e+02, percent-clipped=1.0 2023-10-10 05:08:52,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=249876.66666666666, ans=0.95 2023-10-10 05:09:01,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=249923.33333333334, ans=0.125 2023-10-10 05:09:03,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.77 vs. limit=15.0 2023-10-10 05:09:04,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=249970.0, ans=0.125 2023-10-10 05:09:08,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=249970.0, ans=0.2 2023-10-10 05:09:10,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=249970.0, ans=0.125 2023-10-10 05:09:19,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=250016.66666666666, ans=0.0 2023-10-10 05:09:35,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=250063.33333333334, ans=0.2 2023-10-10 05:09:43,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=250110.0, ans=0.0 2023-10-10 05:09:57,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=250156.66666666666, ans=0.125 2023-10-10 05:09:58,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=250156.66666666666, ans=0.04949747468305833 2023-10-10 05:10:22,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=250250.0, ans=0.0 2023-10-10 05:10:25,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=250296.66666666666, ans=0.125 2023-10-10 05:10:34,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.710e+02 2.008e+02 2.335e+02 3.389e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-10 05:10:46,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=250390.0, ans=0.0 2023-10-10 05:11:03,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=250436.66666666666, ans=0.125 2023-10-10 05:11:20,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=250530.0, ans=0.05 2023-10-10 05:11:21,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=250530.0, ans=0.125 2023-10-10 05:11:28,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=250576.66666666666, ans=0.2 2023-10-10 05:11:45,130 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=15.0 2023-10-10 05:11:51,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=250670.0, ans=0.0 2023-10-10 05:11:54,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=250670.0, ans=0.05 2023-10-10 05:11:54,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=250670.0, ans=0.0 2023-10-10 05:11:56,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=250670.0, ans=0.125 2023-10-10 05:11:56,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2023-10-10 05:12:14,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=250763.33333333334, ans=0.0 2023-10-10 05:12:20,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=250763.33333333334, ans=0.125 2023-10-10 05:12:21,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.810e+02 2.009e+02 2.245e+02 3.158e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-10 05:12:22,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.64 vs. limit=15.0 2023-10-10 05:12:31,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=250856.66666666666, ans=0.125 2023-10-10 05:12:33,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=250856.66666666666, ans=0.0 2023-10-10 05:12:47,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=250903.33333333334, ans=15.0 2023-10-10 05:12:53,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-10-10 05:13:13,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=251043.33333333334, ans=0.125 2023-10-10 05:13:19,992 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:13:24,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=251090.0, ans=0.025 2023-10-10 05:14:10,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.306e+02 1.750e+02 1.943e+02 2.150e+02 2.998e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 05:14:22,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251323.33333333334, ans=0.1 2023-10-10 05:14:39,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.88 vs. limit=15.0 2023-10-10 05:14:46,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=251416.66666666666, ans=0.0 2023-10-10 05:14:51,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=251416.66666666666, ans=0.125 2023-10-10 05:15:10,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-10-10 05:15:11,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=251510.0, ans=0.2 2023-10-10 05:15:14,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=251510.0, ans=0.0 2023-10-10 05:15:20,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=251556.66666666666, ans=0.125 2023-10-10 05:15:28,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=251603.33333333334, ans=0.125 2023-10-10 05:15:35,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=251603.33333333334, ans=0.125 2023-10-10 05:15:42,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251650.0, ans=0.1 2023-10-10 05:15:43,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=22.5 2023-10-10 05:15:46,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=251650.0, ans=0.125 2023-10-10 05:15:55,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.26 vs. limit=15.0 2023-10-10 05:15:57,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=251743.33333333334, ans=0.125 2023-10-10 05:15:57,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.825e+02 2.034e+02 2.506e+02 3.898e+02, threshold=4.069e+02, percent-clipped=1.0 2023-10-10 05:16:04,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251743.33333333334, ans=0.1 2023-10-10 05:16:19,822 INFO [train.py:1031] (3/4) Epoch 4, batch 13000, loss[loss=0.244, simple_loss=0.3269, pruned_loss=0.0805, over 16931.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.323, pruned_loss=0.08095, over 32778372.37 frames. ], batch size: 77, lr: 9.39e-03, grad_scale: 32.0 2023-10-10 05:16:23,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=251836.66666666666, ans=0.125 2023-10-10 05:16:23,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=251836.66666666666, ans=0.2 2023-10-10 05:16:47,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.50 vs. limit=22.5 2023-10-10 05:17:18,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=252023.33333333334, ans=0.125 2023-10-10 05:17:23,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-10 05:17:30,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.85 vs. limit=10.0 2023-10-10 05:17:41,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252116.66666666666, ans=0.1 2023-10-10 05:17:42,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=252116.66666666666, ans=0.125 2023-10-10 05:17:49,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252163.33333333334, ans=0.1 2023-10-10 05:17:57,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.781e+02 1.954e+02 2.240e+02 3.918e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-10 05:18:01,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=252210.0, ans=0.1 2023-10-10 05:18:20,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=252303.33333333334, ans=0.125 2023-10-10 05:18:22,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=252303.33333333334, ans=0.125 2023-10-10 05:18:39,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=252350.0, ans=0.125 2023-10-10 05:18:40,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=252396.66666666666, ans=0.125 2023-10-10 05:18:41,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=252396.66666666666, ans=0.04949747468305833 2023-10-10 05:19:13,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=252490.0, ans=0.125 2023-10-10 05:19:19,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=252536.66666666666, ans=0.125 2023-10-10 05:19:39,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=252630.0, ans=0.125 2023-10-10 05:19:46,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=252630.0, ans=0.125 2023-10-10 05:19:51,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.785e+02 2.025e+02 2.230e+02 3.366e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-10 05:20:08,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=252723.33333333334, ans=0.125 2023-10-10 05:20:41,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=252863.33333333334, ans=0.0 2023-10-10 05:20:41,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=252863.33333333334, ans=0.125 2023-10-10 05:20:52,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252910.0, ans=0.1 2023-10-10 05:20:52,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252910.0, ans=0.1 2023-10-10 05:21:44,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.871e+02 2.080e+02 2.573e+02 4.064e+02, threshold=4.160e+02, percent-clipped=1.0 2023-10-10 05:21:59,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=253190.0, ans=0.035 2023-10-10 05:22:03,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=253190.0, ans=0.125 2023-10-10 05:22:09,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=253236.66666666666, ans=0.0 2023-10-10 05:22:13,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=22.5 2023-10-10 05:22:15,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=253283.33333333334, ans=0.125 2023-10-10 05:22:16,857 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:22:26,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=253283.33333333334, ans=0.125 2023-10-10 05:22:26,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.21 vs. limit=10.0 2023-10-10 05:22:28,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-10-10 05:22:53,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.87 vs. limit=10.0 2023-10-10 05:23:26,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=253563.33333333334, ans=0.0 2023-10-10 05:23:37,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 2.057e+02 2.336e+02 2.783e+02 4.529e+02, threshold=4.671e+02, percent-clipped=2.0 2023-10-10 05:23:40,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=253610.0, ans=0.125 2023-10-10 05:23:40,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=253610.0, ans=0.125 2023-10-10 05:24:49,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=253936.66666666666, ans=0.0 2023-10-10 05:24:59,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=253983.33333333334, ans=0.125 2023-10-10 05:25:00,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2023-10-10 05:25:10,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=254030.0, ans=0.5 2023-10-10 05:25:17,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=254030.0, ans=0.125 2023-10-10 05:25:21,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=254076.66666666666, ans=0.0 2023-10-10 05:25:22,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.758e+02 1.911e+02 2.264e+02 2.946e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-10 05:25:28,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=12.0 2023-10-10 05:25:42,278 INFO [train.py:1031] (3/4) Epoch 4, batch 13500, loss[loss=0.2395, simple_loss=0.3216, pruned_loss=0.07872, over 16944.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3218, pruned_loss=0.08047, over 32793176.10 frames. ], batch size: 123, lr: 9.35e-03, grad_scale: 16.0 2023-10-10 05:26:51,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=254450.0, ans=0.2 2023-10-10 05:27:02,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.28 vs. limit=22.5 2023-10-10 05:27:12,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.759e+02 1.982e+02 2.253e+02 2.975e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-10 05:27:13,422 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2023-10-10 05:27:20,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-10-10 05:27:23,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=254590.0, ans=0.035 2023-10-10 05:28:04,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=254776.66666666666, ans=0.125 2023-10-10 05:28:08,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254823.33333333334, ans=0.1 2023-10-10 05:28:52,477 INFO [train.py:1031] (3/4) Epoch 5, batch 0, loss[loss=0.2277, simple_loss=0.306, pruned_loss=0.0747, over 16883.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.306, pruned_loss=0.0747, over 16883.00 frames. ], batch size: 72, lr: 8.17e-03, grad_scale: 32.0 2023-10-10 05:28:52,478 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-10 05:29:00,183 INFO [train.py:1063] (3/4) Epoch 5, validation: loss=0.2397, simple_loss=0.3257, pruned_loss=0.07681, over 1020973.00 frames. 2023-10-10 05:29:00,183 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-10 05:29:08,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=254893.33333333334, ans=0.125 2023-10-10 05:29:21,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=254986.66666666666, ans=0.1 2023-10-10 05:29:28,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.854e+02 2.007e+02 2.304e+02 3.822e+02, threshold=4.014e+02, percent-clipped=0.0 2023-10-10 05:29:30,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=254986.66666666666, ans=0.025 2023-10-10 05:29:30,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=254986.66666666666, ans=0.125 2023-10-10 05:29:35,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=255033.33333333334, ans=0.125 2023-10-10 05:29:48,394 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:29:52,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-10-10 05:29:56,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-10-10 05:30:17,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=255220.0, ans=0.125 2023-10-10 05:30:34,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=255266.66666666666, ans=0.09899494936611666 2023-10-10 05:30:35,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-10-10 05:30:35,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=255266.66666666666, ans=0.125 2023-10-10 05:30:39,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=255266.66666666666, ans=0.2 2023-10-10 05:30:47,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-10 05:30:50,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=255360.0, ans=0.125 2023-10-10 05:31:01,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=255406.66666666666, ans=0.0 2023-10-10 05:31:07,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=255406.66666666666, ans=0.0 2023-10-10 05:31:10,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=255406.66666666666, ans=10.0 2023-10-10 05:31:19,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.663e+02 1.776e+02 2.118e+02 3.146e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-10 05:31:29,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255500.0, ans=0.1 2023-10-10 05:31:31,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=255500.0, ans=0.125 2023-10-10 05:31:39,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-10 05:32:16,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=255733.33333333334, ans=0.1 2023-10-10 05:32:20,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=255733.33333333334, ans=0.025 2023-10-10 05:32:48,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=255873.33333333334, ans=0.0 2023-10-10 05:32:57,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=255873.33333333334, ans=0.2 2023-10-10 05:33:05,570 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.871e+02 2.164e+02 2.479e+02 3.459e+02, threshold=4.327e+02, percent-clipped=0.0 2023-10-10 05:33:11,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=255966.66666666666, ans=0.125 2023-10-10 05:33:24,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=256013.33333333334, ans=0.125 2023-10-10 05:33:42,976 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.89 vs. limit=22.5 2023-10-10 05:34:16,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=256246.66666666666, ans=0.0 2023-10-10 05:34:46,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=256340.0, ans=0.125 2023-10-10 05:34:53,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2023-10-10 05:34:57,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.829e+02 2.034e+02 2.392e+02 3.401e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-10 05:35:15,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=15.0 2023-10-10 05:35:22,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=256526.66666666666, ans=0.07 2023-10-10 05:35:36,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=256573.33333333334, ans=0.125 2023-10-10 05:35:40,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=256573.33333333334, ans=0.125 2023-10-10 05:35:47,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=256620.0, ans=0.125 2023-10-10 05:36:31,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-10-10 05:36:32,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=256806.66666666666, ans=0.0 2023-10-10 05:36:41,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.19 vs. limit=22.5 2023-10-10 05:36:42,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.905e+02 2.404e+02 2.784e+02 3.937e+02, threshold=4.808e+02, percent-clipped=0.0 2023-10-10 05:36:55,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=256900.0, ans=0.125 2023-10-10 05:37:02,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=256946.66666666666, ans=0.04949747468305833 2023-10-10 05:37:08,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=256993.33333333334, ans=0.125 2023-10-10 05:37:11,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=256993.33333333334, ans=0.2 2023-10-10 05:37:25,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=257040.0, ans=0.125 2023-10-10 05:37:49,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=257133.33333333334, ans=0.0 2023-10-10 05:37:57,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=257180.0, ans=0.2 2023-10-10 05:38:00,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257180.0, ans=0.1 2023-10-10 05:38:07,247 INFO [train.py:1031] (3/4) Epoch 5, batch 500, loss[loss=0.2406, simple_loss=0.3097, pruned_loss=0.08575, over 16164.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3206, pruned_loss=0.07964, over 7285084.96 frames. ], batch size: 297, lr: 8.14e-03, grad_scale: 32.0 2023-10-10 05:38:20,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=257273.33333333334, ans=0.125 2023-10-10 05:38:21,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-10-10 05:38:26,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257273.33333333334, ans=0.125 2023-10-10 05:38:35,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.710e+02 1.903e+02 2.059e+02 2.708e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 05:39:23,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=257553.33333333334, ans=0.125 2023-10-10 05:39:47,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.50 vs. limit=22.5 2023-10-10 05:39:55,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=257693.33333333334, ans=0.125 2023-10-10 05:39:55,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.58 vs. limit=22.5 2023-10-10 05:40:00,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-10 05:40:11,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=257740.0, ans=0.125 2023-10-10 05:40:12,402 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:40:23,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.783e+02 2.012e+02 2.365e+02 2.915e+02, threshold=4.024e+02, percent-clipped=0.0 2023-10-10 05:40:31,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257833.33333333334, ans=0.1 2023-10-10 05:40:35,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.57 vs. limit=15.0 2023-10-10 05:40:40,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257880.0, ans=0.125 2023-10-10 05:40:48,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257880.0, ans=0.1 2023-10-10 05:40:57,302 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:40:57,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=257926.66666666666, ans=0.125 2023-10-10 05:41:19,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=258020.0, ans=0.125 2023-10-10 05:41:21,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=258020.0, ans=0.0 2023-10-10 05:41:42,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=258113.33333333334, ans=0.0 2023-10-10 05:41:51,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.17 vs. limit=22.5 2023-10-10 05:41:57,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.24 vs. limit=15.0 2023-10-10 05:42:02,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.51 vs. limit=15.0 2023-10-10 05:42:12,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.825e+02 2.056e+02 2.493e+02 3.427e+02, threshold=4.113e+02, percent-clipped=0.0 2023-10-10 05:42:29,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=258300.0, ans=0.125 2023-10-10 05:42:39,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=258346.66666666666, ans=0.125 2023-10-10 05:42:58,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=258440.0, ans=0.125 2023-10-10 05:43:16,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=258533.33333333334, ans=0.125 2023-10-10 05:43:26,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=258580.0, ans=0.0 2023-10-10 05:43:44,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=258626.66666666666, ans=0.0 2023-10-10 05:43:59,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258720.0, ans=0.1 2023-10-10 05:44:06,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.753e+02 1.933e+02 2.301e+02 3.796e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-10 05:44:30,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=15.0 2023-10-10 05:44:37,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.31 vs. limit=12.0 2023-10-10 05:44:39,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-10 05:45:01,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258953.33333333334, ans=0.1 2023-10-10 05:45:19,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=259000.0, ans=0.0 2023-10-10 05:45:43,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-10-10 05:45:51,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=259140.0, ans=0.125 2023-10-10 05:46:02,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.729e+02 1.900e+02 2.226e+02 3.301e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-10 05:46:02,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=259186.66666666666, ans=0.125 2023-10-10 05:46:07,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.38 vs. limit=15.0 2023-10-10 05:46:14,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259233.33333333334, ans=0.125 2023-10-10 05:46:30,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259326.66666666666, ans=0.1 2023-10-10 05:46:32,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=259326.66666666666, ans=0.125 2023-10-10 05:46:35,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=259326.66666666666, ans=0.07 2023-10-10 05:46:37,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.69 vs. limit=10.0 2023-10-10 05:46:40,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=259326.66666666666, ans=0.0 2023-10-10 05:47:00,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=259420.0, ans=0.125 2023-10-10 05:47:07,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=259466.66666666666, ans=0.0 2023-10-10 05:47:11,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=259466.66666666666, ans=0.125 2023-10-10 05:47:13,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=259513.33333333334, ans=0.125 2023-10-10 05:47:15,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=259513.33333333334, ans=0.125 2023-10-10 05:47:20,358 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.11 vs. limit=10.0 2023-10-10 05:47:25,139 INFO [train.py:1031] (3/4) Epoch 5, batch 1000, loss[loss=0.2264, simple_loss=0.3132, pruned_loss=0.06978, over 16950.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3205, pruned_loss=0.07955, over 12906722.74 frames. ], batch size: 130, lr: 8.10e-03, grad_scale: 32.0 2023-10-10 05:47:30,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=259560.0, ans=0.05 2023-10-10 05:47:40,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=259606.66666666666, ans=0.2 2023-10-10 05:47:42,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.23 vs. limit=22.5 2023-10-10 05:47:43,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=259606.66666666666, ans=0.09899494936611666 2023-10-10 05:47:52,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.755e+02 2.072e+02 2.523e+02 4.280e+02, threshold=4.144e+02, percent-clipped=3.0 2023-10-10 05:47:53,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=259653.33333333334, ans=0.125 2023-10-10 05:48:00,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=259700.0, ans=0.1 2023-10-10 05:48:20,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=259793.33333333334, ans=0.015 2023-10-10 05:48:45,422 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:48:54,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=259933.33333333334, ans=0.0 2023-10-10 05:48:54,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.51 vs. limit=15.0 2023-10-10 05:49:29,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=260073.33333333334, ans=0.125 2023-10-10 05:49:35,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=260120.0, ans=10.0 2023-10-10 05:49:40,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.785e+02 2.054e+02 2.321e+02 3.377e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-10 05:49:45,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=260166.66666666666, ans=0.2 2023-10-10 05:49:49,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=260166.66666666666, ans=0.125 2023-10-10 05:50:03,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260213.33333333334, ans=0.1 2023-10-10 05:50:04,271 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:50:21,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=260306.66666666666, ans=0.0 2023-10-10 05:50:40,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=260353.33333333334, ans=0.1 2023-10-10 05:50:51,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=260400.0, ans=0.125 2023-10-10 05:50:57,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=260400.0, ans=0.2 2023-10-10 05:51:02,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=260446.66666666666, ans=0.125 2023-10-10 05:51:14,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=260493.33333333334, ans=0.025 2023-10-10 05:51:19,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=260540.0, ans=0.2 2023-10-10 05:51:30,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.60 vs. limit=15.0 2023-10-10 05:51:34,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=260586.66666666666, ans=0.125 2023-10-10 05:51:37,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.584e+02 1.856e+02 2.185e+02 3.806e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-10 05:51:47,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260633.33333333334, ans=0.1 2023-10-10 05:51:52,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=260680.0, ans=0.125 2023-10-10 05:51:58,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=260680.0, ans=0.125 2023-10-10 05:52:24,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=260820.0, ans=0.0 2023-10-10 05:52:40,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-10-10 05:52:48,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=260913.33333333334, ans=0.125 2023-10-10 05:52:55,422 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-10-10 05:53:07,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-10-10 05:53:11,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=261006.66666666666, ans=10.0 2023-10-10 05:53:22,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.722e+02 1.988e+02 2.282e+02 3.021e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-10 05:53:32,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261100.0, ans=0.1 2023-10-10 05:53:41,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=261146.66666666666, ans=0.125 2023-10-10 05:53:42,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=261146.66666666666, ans=0.0 2023-10-10 05:53:49,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=261193.33333333334, ans=0.2 2023-10-10 05:53:49,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=261193.33333333334, ans=0.2 2023-10-10 05:53:53,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261193.33333333334, ans=0.1 2023-10-10 05:54:01,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=261240.0, ans=0.2 2023-10-10 05:54:07,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-10-10 05:54:08,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261286.66666666666, ans=0.1 2023-10-10 05:54:32,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=261380.0, ans=0.125 2023-10-10 05:54:41,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=261380.0, ans=0.125 2023-10-10 05:55:02,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=261473.33333333334, ans=0.2 2023-10-10 05:55:09,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=261520.0, ans=0.125 2023-10-10 05:55:12,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.774e+02 1.995e+02 2.346e+02 3.424e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-10 05:55:21,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=261566.66666666666, ans=0.125 2023-10-10 05:55:21,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=261566.66666666666, ans=0.125 2023-10-10 05:55:31,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=261613.33333333334, ans=0.125 2023-10-10 05:55:54,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-10-10 05:56:21,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=261800.0, ans=0.05 2023-10-10 05:56:28,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=261846.66666666666, ans=0.0 2023-10-10 05:56:35,856 INFO [train.py:1031] (3/4) Epoch 5, batch 1500, loss[loss=0.2045, simple_loss=0.2928, pruned_loss=0.0581, over 16910.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3183, pruned_loss=0.07816, over 17304268.85 frames. ], batch size: 77, lr: 8.07e-03, grad_scale: 32.0 2023-10-10 05:56:58,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=261940.0, ans=0.125 2023-10-10 05:57:00,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-10 05:57:02,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=261986.66666666666, ans=0.2 2023-10-10 05:57:05,781 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:57:07,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.699e+02 1.927e+02 2.323e+02 3.652e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-10 05:57:48,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=262173.3333333333, ans=0.125 2023-10-10 05:58:00,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=262220.0, ans=0.0 2023-10-10 05:58:00,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.47 vs. limit=10.0 2023-10-10 05:58:12,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=262266.6666666667, ans=10.0 2023-10-10 05:58:25,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=12.0 2023-10-10 05:58:46,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=262406.6666666667, ans=0.0 2023-10-10 05:58:49,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=262406.6666666667, ans=0.2 2023-10-10 05:58:52,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.87 vs. limit=6.0 2023-10-10 05:58:59,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=262453.3333333333, ans=0.125 2023-10-10 05:58:59,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.842e+02 2.109e+02 2.479e+02 3.330e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-10 05:59:00,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=262453.3333333333, ans=0.125 2023-10-10 05:59:14,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=262500.0, ans=0.125 2023-10-10 05:59:24,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=262546.6666666667, ans=0.1 2023-10-10 05:59:50,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=262640.0, ans=0.1 2023-10-10 06:00:09,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=262733.3333333333, ans=0.2 2023-10-10 06:00:19,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=262780.0, ans=0.125 2023-10-10 06:00:25,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=262780.0, ans=0.125 2023-10-10 06:00:26,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.26 vs. limit=15.0 2023-10-10 06:00:27,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-10-10 06:00:39,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=262873.3333333333, ans=0.07 2023-10-10 06:00:56,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.713e+02 1.891e+02 2.172e+02 4.297e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-10 06:00:57,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=262920.0, ans=0.0 2023-10-10 06:01:02,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=262966.6666666667, ans=0.125 2023-10-10 06:01:06,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=262966.6666666667, ans=0.2 2023-10-10 06:01:09,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=263013.3333333333, ans=0.125 2023-10-10 06:01:12,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=263013.3333333333, ans=0.125 2023-10-10 06:01:22,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=12.0 2023-10-10 06:01:35,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-10-10 06:01:51,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263153.3333333333, ans=0.1 2023-10-10 06:01:51,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-10-10 06:01:57,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.48 vs. limit=10.0 2023-10-10 06:02:00,661 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.92 vs. limit=22.5 2023-10-10 06:02:02,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263200.0, ans=0.1 2023-10-10 06:02:03,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=263200.0, ans=0.125 2023-10-10 06:02:07,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=263200.0, ans=0.1 2023-10-10 06:02:24,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=263293.3333333333, ans=0.125 2023-10-10 06:02:26,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.16 vs. limit=15.0 2023-10-10 06:02:31,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.33 vs. limit=22.5 2023-10-10 06:02:36,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=263340.0, ans=10.0 2023-10-10 06:02:50,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.686e+02 1.903e+02 2.219e+02 3.598e+02, threshold=3.806e+02, percent-clipped=1.0 2023-10-10 06:03:02,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=263433.3333333333, ans=0.02 2023-10-10 06:03:10,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-10-10 06:03:26,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=263573.3333333333, ans=0.0 2023-10-10 06:03:32,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=263573.3333333333, ans=0.125 2023-10-10 06:03:33,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=263573.3333333333, ans=0.2 2023-10-10 06:03:39,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263620.0, ans=0.0 2023-10-10 06:03:45,685 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:03:57,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=263666.6666666667, ans=0.0 2023-10-10 06:04:04,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=22.5 2023-10-10 06:04:08,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=263713.3333333333, ans=0.07 2023-10-10 06:04:16,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=263760.0, ans=0.07 2023-10-10 06:04:20,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=263806.6666666667, ans=0.125 2023-10-10 06:04:22,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=263806.6666666667, ans=0.0 2023-10-10 06:04:38,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.740e+02 1.965e+02 2.370e+02 3.403e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-10 06:04:42,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=263900.0, ans=0.125 2023-10-10 06:05:27,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=264040.0, ans=0.0 2023-10-10 06:05:39,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=264086.6666666667, ans=0.025 2023-10-10 06:05:43,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=264086.6666666667, ans=0.0 2023-10-10 06:05:55,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264133.3333333333, ans=0.1 2023-10-10 06:05:57,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=264133.3333333333, ans=0.2 2023-10-10 06:06:13,124 INFO [train.py:1031] (3/4) Epoch 5, batch 2000, loss[loss=0.2466, simple_loss=0.3345, pruned_loss=0.07934, over 16840.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3187, pruned_loss=0.07794, over 20733251.57 frames. ], batch size: 155, lr: 8.03e-03, grad_scale: 32.0 2023-10-10 06:06:18,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-10 06:06:37,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=15.0 2023-10-10 06:06:49,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=264320.0, ans=0.125 2023-10-10 06:06:50,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.794e+02 2.060e+02 2.220e+02 2.983e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-10 06:06:56,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=264366.6666666667, ans=0.0 2023-10-10 06:06:56,375 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-10 06:07:03,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=264366.6666666667, ans=0.125 2023-10-10 06:07:29,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=264460.0, ans=0.0 2023-10-10 06:07:38,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=264506.6666666667, ans=0.015 2023-10-10 06:07:45,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=264553.3333333333, ans=0.2 2023-10-10 06:07:57,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264600.0, ans=0.1 2023-10-10 06:08:05,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=264646.6666666667, ans=0.125 2023-10-10 06:08:05,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.82 vs. limit=15.0 2023-10-10 06:08:17,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=264646.6666666667, ans=10.0 2023-10-10 06:08:33,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=264693.3333333333, ans=0.0 2023-10-10 06:08:35,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=264693.3333333333, ans=0.0 2023-10-10 06:09:07,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=264786.6666666667, ans=0.0 2023-10-10 06:09:08,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.650e+02 1.892e+02 2.078e+02 2.911e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-10 06:09:12,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=264833.3333333333, ans=0.125 2023-10-10 06:09:19,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.81 vs. limit=15.0 2023-10-10 06:09:19,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-10-10 06:09:29,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=264880.0, ans=0.125 2023-10-10 06:09:55,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-10-10 06:09:57,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264973.3333333333, ans=0.125 2023-10-10 06:10:04,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-10-10 06:10:05,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=265020.0, ans=0.2 2023-10-10 06:10:07,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265020.0, ans=0.1 2023-10-10 06:10:20,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=265113.3333333333, ans=0.5 2023-10-10 06:10:47,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.15 vs. limit=22.5 2023-10-10 06:10:51,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=265206.6666666667, ans=0.125 2023-10-10 06:11:01,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.757e+02 1.978e+02 2.230e+02 3.113e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-10 06:11:08,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=265300.0, ans=0.125 2023-10-10 06:11:09,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=265300.0, ans=0.0 2023-10-10 06:11:16,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=265346.6666666667, ans=0.125 2023-10-10 06:11:19,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=265346.6666666667, ans=0.125 2023-10-10 06:11:33,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=265393.3333333333, ans=0.05 2023-10-10 06:11:34,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=22.5 2023-10-10 06:11:48,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-10-10 06:11:54,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=265486.6666666667, ans=0.2 2023-10-10 06:11:54,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=265486.6666666667, ans=0.125 2023-10-10 06:12:03,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=265533.3333333333, ans=0.125 2023-10-10 06:12:18,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=265626.6666666667, ans=0.0 2023-10-10 06:12:19,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=265626.6666666667, ans=0.0 2023-10-10 06:12:40,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265720.0, ans=0.125 2023-10-10 06:12:46,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.268e+02 1.735e+02 1.886e+02 2.094e+02 2.828e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-10 06:12:47,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.92 vs. limit=22.5 2023-10-10 06:12:48,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=265766.6666666667, ans=0.0 2023-10-10 06:12:50,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=265766.6666666667, ans=0.125 2023-10-10 06:12:50,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265766.6666666667, ans=0.125 2023-10-10 06:13:00,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=265766.6666666667, ans=0.2 2023-10-10 06:13:09,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265813.3333333333, ans=0.125 2023-10-10 06:13:24,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=265906.6666666667, ans=0.1 2023-10-10 06:13:55,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.57 vs. limit=22.5 2023-10-10 06:13:56,443 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:14:01,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=266046.6666666667, ans=0.0 2023-10-10 06:14:07,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.06 vs. limit=22.5 2023-10-10 06:14:08,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266093.3333333333, ans=0.1 2023-10-10 06:14:21,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266140.0, ans=0.125 2023-10-10 06:14:22,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=6.0 2023-10-10 06:14:28,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266186.6666666667, ans=0.125 2023-10-10 06:14:33,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.877e+02 2.196e+02 2.532e+02 3.568e+02, threshold=4.392e+02, percent-clipped=0.0 2023-10-10 06:14:36,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=12.0 2023-10-10 06:14:40,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=266233.3333333333, ans=0.2 2023-10-10 06:14:57,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=266280.0, ans=0.2 2023-10-10 06:14:57,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=266280.0, ans=0.0 2023-10-10 06:15:05,144 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:15:17,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=266373.3333333333, ans=0.0 2023-10-10 06:15:22,171 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-10 06:15:22,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=266420.0, ans=10.0 2023-10-10 06:15:57,182 INFO [train.py:1031] (3/4) Epoch 5, batch 2500, loss[loss=0.2034, simple_loss=0.2682, pruned_loss=0.06932, over 12758.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3186, pruned_loss=0.078, over 23415710.45 frames. ], batch size: 440, lr: 8.00e-03, grad_scale: 32.0 2023-10-10 06:16:26,365 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.774e+02 1.929e+02 2.226e+02 3.327e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-10 06:16:31,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=266700.0, ans=0.125 2023-10-10 06:16:32,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=266700.0, ans=0.0 2023-10-10 06:16:43,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.02 vs. limit=22.5 2023-10-10 06:16:57,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-10-10 06:17:03,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.80 vs. limit=15.0 2023-10-10 06:17:08,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=266840.0, ans=0.0 2023-10-10 06:17:12,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=266886.6666666667, ans=0.125 2023-10-10 06:17:14,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.73 vs. limit=22.5 2023-10-10 06:17:16,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.27 vs. limit=15.0 2023-10-10 06:17:18,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266886.6666666667, ans=0.1 2023-10-10 06:17:46,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=267026.6666666667, ans=0.07 2023-10-10 06:17:50,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=15.0 2023-10-10 06:17:55,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=267026.6666666667, ans=0.125 2023-10-10 06:18:00,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=267073.3333333333, ans=0.2 2023-10-10 06:18:09,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=267120.0, ans=0.0 2023-10-10 06:18:13,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=267120.0, ans=0.125 2023-10-10 06:18:18,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.774e+02 1.934e+02 2.190e+02 3.199e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-10 06:18:19,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.75 vs. limit=22.5 2023-10-10 06:19:01,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.11 vs. limit=15.0 2023-10-10 06:19:10,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=267353.3333333333, ans=0.125 2023-10-10 06:19:10,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=267353.3333333333, ans=0.125 2023-10-10 06:19:25,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-10-10 06:19:36,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=267446.6666666667, ans=0.125 2023-10-10 06:19:41,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=267493.3333333333, ans=0.125 2023-10-10 06:19:42,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-10-10 06:19:50,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=267493.3333333333, ans=0.125 2023-10-10 06:20:14,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.747e+02 1.976e+02 2.222e+02 3.594e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-10 06:20:51,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.96 vs. limit=22.5 2023-10-10 06:20:57,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=267773.3333333333, ans=0.125 2023-10-10 06:21:02,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=267773.3333333333, ans=0.04949747468305833 2023-10-10 06:21:05,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=267773.3333333333, ans=0.125 2023-10-10 06:21:41,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=22.5 2023-10-10 06:21:44,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=267913.3333333333, ans=0.125 2023-10-10 06:21:49,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=267960.0, ans=0.0 2023-10-10 06:22:11,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2023-10-10 06:22:17,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.749e+02 1.930e+02 2.387e+02 3.973e+02, threshold=3.860e+02, percent-clipped=1.0 2023-10-10 06:22:37,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=268146.6666666667, ans=0.125 2023-10-10 06:22:45,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=268146.6666666667, ans=0.0 2023-10-10 06:22:56,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.80 vs. limit=22.5 2023-10-10 06:23:43,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-10-10 06:23:54,752 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:24:12,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=268473.3333333333, ans=0.125 2023-10-10 06:24:15,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=268473.3333333333, ans=0.125 2023-10-10 06:24:25,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=268520.0, ans=0.125 2023-10-10 06:24:26,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.770e+02 1.953e+02 2.195e+02 3.188e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-10 06:24:26,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2023-10-10 06:24:35,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=268566.6666666667, ans=0.125 2023-10-10 06:24:39,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=268613.3333333333, ans=0.0 2023-10-10 06:25:06,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.17 vs. limit=22.5 2023-10-10 06:25:10,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=268706.6666666667, ans=0.1 2023-10-10 06:25:25,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=268800.0, ans=0.2 2023-10-10 06:25:26,291 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:25:34,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268846.6666666667, ans=0.0 2023-10-10 06:25:45,692 INFO [train.py:1031] (3/4) Epoch 5, batch 3000, loss[loss=0.2378, simple_loss=0.2878, pruned_loss=0.09395, over 12407.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3174, pruned_loss=0.07751, over 25489113.55 frames. ], batch size: 440, lr: 7.96e-03, grad_scale: 16.0 2023-10-10 06:25:54,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=268893.3333333333, ans=0.015 2023-10-10 06:26:01,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=268940.0, ans=0.2 2023-10-10 06:26:15,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.711e+02 1.905e+02 2.184e+02 3.895e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 06:26:18,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=269033.3333333333, ans=0.025 2023-10-10 06:26:24,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=269033.3333333333, ans=0.125 2023-10-10 06:26:25,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=269033.3333333333, ans=0.1 2023-10-10 06:26:50,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-10-10 06:27:37,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=269313.3333333333, ans=0.125 2023-10-10 06:27:38,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269360.0, ans=0.1 2023-10-10 06:27:46,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269360.0, ans=0.1 2023-10-10 06:27:53,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=269360.0, ans=0.1 2023-10-10 06:28:08,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=269453.3333333333, ans=0.125 2023-10-10 06:28:14,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=269453.3333333333, ans=0.05 2023-10-10 06:28:16,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.847e+02 2.173e+02 2.461e+02 3.508e+02, threshold=4.345e+02, percent-clipped=0.0 2023-10-10 06:28:21,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=269500.0, ans=0.2 2023-10-10 06:28:37,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=269546.6666666667, ans=0.125 2023-10-10 06:29:05,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=269686.6666666667, ans=0.125 2023-10-10 06:29:15,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.18 vs. limit=15.0 2023-10-10 06:29:29,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=269780.0, ans=0.125 2023-10-10 06:29:36,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=269826.6666666667, ans=0.125 2023-10-10 06:29:36,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-10 06:29:38,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=269826.6666666667, ans=0.125 2023-10-10 06:29:43,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=269826.6666666667, ans=0.0 2023-10-10 06:29:48,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.14 vs. limit=10.0 2023-10-10 06:29:49,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=12.0 2023-10-10 06:29:58,586 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.49 vs. limit=10.0 2023-10-10 06:30:08,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.732e+02 1.995e+02 2.215e+02 3.118e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-10 06:31:14,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=270153.3333333333, ans=0.125 2023-10-10 06:31:54,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.57 vs. limit=22.5 2023-10-10 06:31:59,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=270340.0, ans=0.0 2023-10-10 06:32:09,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=270386.6666666667, ans=0.0 2023-10-10 06:32:10,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=270386.6666666667, ans=0.2 2023-10-10 06:32:11,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=270386.6666666667, ans=0.0 2023-10-10 06:32:14,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.890e+02 2.285e+02 2.774e+02 4.730e+02, threshold=4.570e+02, percent-clipped=2.0 2023-10-10 06:32:21,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270433.3333333333, ans=0.1 2023-10-10 06:32:35,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=12.0 2023-10-10 06:33:09,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270620.0, ans=0.1 2023-10-10 06:33:09,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=270620.0, ans=0.0 2023-10-10 06:33:16,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=270666.6666666667, ans=0.125 2023-10-10 06:33:19,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=270666.6666666667, ans=0.125 2023-10-10 06:33:48,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-10-10 06:33:49,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=270760.0, ans=0.2 2023-10-10 06:33:49,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-10 06:34:12,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.747e+02 1.945e+02 2.257e+02 3.070e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-10 06:34:14,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=270900.0, ans=0.0 2023-10-10 06:34:19,542 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.63 vs. limit=22.5 2023-10-10 06:34:27,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=270946.6666666667, ans=0.125 2023-10-10 06:34:44,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=270993.3333333333, ans=0.125 2023-10-10 06:34:45,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=270993.3333333333, ans=0.125 2023-10-10 06:34:49,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=271040.0, ans=0.0 2023-10-10 06:34:56,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=271040.0, ans=0.0 2023-10-10 06:35:02,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271086.6666666667, ans=0.1 2023-10-10 06:35:18,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271133.3333333333, ans=0.125 2023-10-10 06:35:19,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=271133.3333333333, ans=0.1 2023-10-10 06:35:34,972 INFO [train.py:1031] (3/4) Epoch 5, batch 3500, loss[loss=0.2337, simple_loss=0.3177, pruned_loss=0.07486, over 16867.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3172, pruned_loss=0.07751, over 27109237.60 frames. ], batch size: 155, lr: 7.93e-03, grad_scale: 32.0 2023-10-10 06:35:41,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=271226.6666666667, ans=0.125 2023-10-10 06:35:46,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.38 vs. limit=15.0 2023-10-10 06:35:59,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=271320.0, ans=0.125 2023-10-10 06:36:05,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.690e+02 1.933e+02 2.157e+02 3.510e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-10 06:36:27,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=271413.3333333333, ans=0.0 2023-10-10 06:36:31,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=271460.0, ans=0.2 2023-10-10 06:36:36,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=271460.0, ans=0.125 2023-10-10 06:37:05,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271553.3333333333, ans=0.1 2023-10-10 06:37:10,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271553.3333333333, ans=0.1 2023-10-10 06:37:18,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=271600.0, ans=0.125 2023-10-10 06:37:42,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=271693.3333333333, ans=0.1 2023-10-10 06:38:06,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=271786.6666666667, ans=0.5 2023-10-10 06:38:09,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.716e+02 1.904e+02 2.245e+02 2.766e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 06:38:32,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=271880.0, ans=0.125 2023-10-10 06:39:04,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=272020.0, ans=0.125 2023-10-10 06:39:06,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=272020.0, ans=0.125 2023-10-10 06:39:28,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=272113.3333333333, ans=0.0 2023-10-10 06:39:33,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.51 vs. limit=22.5 2023-10-10 06:39:42,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-10-10 06:40:07,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.870e+02 2.163e+02 2.587e+02 4.217e+02, threshold=4.327e+02, percent-clipped=1.0 2023-10-10 06:40:16,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=272300.0, ans=0.0 2023-10-10 06:40:20,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272300.0, ans=0.1 2023-10-10 06:40:23,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=272346.6666666667, ans=0.125 2023-10-10 06:40:24,333 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:40:44,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=272393.3333333333, ans=0.125 2023-10-10 06:40:52,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=272440.0, ans=0.0 2023-10-10 06:41:01,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=272486.6666666667, ans=0.125 2023-10-10 06:41:05,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-10-10 06:41:29,057 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:42:05,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=272720.0, ans=0.0 2023-10-10 06:42:10,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.775e+02 2.000e+02 2.230e+02 3.104e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-10 06:42:17,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=272766.6666666667, ans=0.2 2023-10-10 06:42:31,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2023-10-10 06:43:32,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=273093.3333333333, ans=0.0 2023-10-10 06:43:36,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=273093.3333333333, ans=10.0 2023-10-10 06:43:59,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.689e+02 1.990e+02 2.315e+02 4.038e+02, threshold=3.980e+02, percent-clipped=1.0 2023-10-10 06:44:10,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=273233.3333333333, ans=0.04949747468305833 2023-10-10 06:44:19,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=273280.0, ans=0.05 2023-10-10 06:44:22,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.44 vs. limit=15.0 2023-10-10 06:44:33,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=273373.3333333333, ans=0.125 2023-10-10 06:44:40,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=273373.3333333333, ans=0.125 2023-10-10 06:45:09,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273513.3333333333, ans=0.1 2023-10-10 06:45:17,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273513.3333333333, ans=0.1 2023-10-10 06:45:20,882 INFO [train.py:1031] (3/4) Epoch 5, batch 4000, loss[loss=0.2572, simple_loss=0.3429, pruned_loss=0.08578, over 16826.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.316, pruned_loss=0.07706, over 28347688.59 frames. ], batch size: 188, lr: 7.89e-03, grad_scale: 32.0 2023-10-10 06:45:55,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.31 vs. limit=12.0 2023-10-10 06:45:56,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.819e+02 2.014e+02 2.211e+02 2.956e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-10 06:45:58,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273700.0, ans=0.1 2023-10-10 06:46:36,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2023-10-10 06:46:54,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273933.3333333333, ans=0.1 2023-10-10 06:46:57,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=273933.3333333333, ans=0.125 2023-10-10 06:46:57,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=273933.3333333333, ans=0.0 2023-10-10 06:47:06,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=273980.0, ans=0.2 2023-10-10 06:47:26,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=274073.3333333333, ans=0.125 2023-10-10 06:47:27,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=274073.3333333333, ans=0.0 2023-10-10 06:47:33,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=274073.3333333333, ans=0.0 2023-10-10 06:47:33,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=274073.3333333333, ans=0.125 2023-10-10 06:47:38,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=274120.0, ans=0.0 2023-10-10 06:47:38,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=274120.0, ans=0.09899494936611666 2023-10-10 06:47:50,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.864e+02 2.132e+02 2.632e+02 3.693e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-10 06:48:19,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=274260.0, ans=0.125 2023-10-10 06:48:47,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=274353.3333333333, ans=0.125 2023-10-10 06:49:02,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.09 vs. limit=22.5 2023-10-10 06:49:02,214 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.50 vs. limit=10.0 2023-10-10 06:49:03,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=274400.0, ans=0.125 2023-10-10 06:49:08,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=274400.0, ans=0.125 2023-10-10 06:49:16,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-10-10 06:49:19,824 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-10-10 06:49:52,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=274586.6666666667, ans=0.125 2023-10-10 06:50:00,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.797e+02 2.056e+02 2.431e+02 3.962e+02, threshold=4.112e+02, percent-clipped=0.0 2023-10-10 06:50:06,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=274633.3333333333, ans=0.0 2023-10-10 06:50:14,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=274680.0, ans=0.2 2023-10-10 06:50:25,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=274726.6666666667, ans=0.2 2023-10-10 06:50:34,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-10-10 06:50:47,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=274820.0, ans=0.0 2023-10-10 06:51:00,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=274866.6666666667, ans=0.125 2023-10-10 06:51:04,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=274866.6666666667, ans=0.125 2023-10-10 06:51:08,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274913.3333333333, ans=0.125 2023-10-10 06:51:08,532 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.67 vs. limit=22.5 2023-10-10 06:51:17,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=274913.3333333333, ans=0.125 2023-10-10 06:51:32,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=275006.6666666667, ans=0.125 2023-10-10 06:51:36,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=275006.6666666667, ans=0.2 2023-10-10 06:51:50,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.79 vs. limit=15.0 2023-10-10 06:51:53,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.805e+02 2.049e+02 2.323e+02 3.623e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-10 06:52:04,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.46 vs. limit=22.5 2023-10-10 06:52:20,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=275193.3333333333, ans=0.125 2023-10-10 06:52:32,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=275240.0, ans=0.2 2023-10-10 06:52:32,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275240.0, ans=0.1 2023-10-10 06:53:02,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.98 vs. limit=10.0 2023-10-10 06:53:08,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-10-10 06:53:18,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.59 vs. limit=22.5 2023-10-10 06:53:19,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=275426.6666666667, ans=0.125 2023-10-10 06:53:20,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=275426.6666666667, ans=0.0 2023-10-10 06:53:42,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=275520.0, ans=0.125 2023-10-10 06:53:44,496 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-10-10 06:53:48,022 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.858e+02 2.069e+02 2.331e+02 3.898e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-10 06:54:12,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=275613.3333333333, ans=0.0 2023-10-10 06:54:17,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275613.3333333333, ans=0.1 2023-10-10 06:54:21,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275660.0, ans=0.1 2023-10-10 06:54:24,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=275660.0, ans=12.0 2023-10-10 06:54:27,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=275660.0, ans=0.0 2023-10-10 06:54:47,928 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-10 06:54:55,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-10-10 06:55:04,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=275800.0, ans=0.0 2023-10-10 06:55:06,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=275800.0, ans=0.0 2023-10-10 06:55:06,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=275800.0, ans=0.125 2023-10-10 06:55:12,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275846.6666666667, ans=0.1 2023-10-10 06:55:21,071 INFO [train.py:1031] (3/4) Epoch 5, batch 4500, loss[loss=0.2424, simple_loss=0.327, pruned_loss=0.07889, over 16853.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3163, pruned_loss=0.07671, over 29349112.46 frames. ], batch size: 175, lr: 7.86e-03, grad_scale: 32.0 2023-10-10 06:55:22,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=275893.3333333333, ans=0.2 2023-10-10 06:55:38,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-10-10 06:55:44,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=275986.6666666667, ans=0.2 2023-10-10 06:55:50,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.21 vs. limit=15.0 2023-10-10 06:55:53,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.306e+02 1.732e+02 1.874e+02 2.074e+02 3.499e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-10 06:56:03,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=276033.3333333333, ans=0.125 2023-10-10 06:56:11,334 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.28 vs. limit=15.0 2023-10-10 06:56:15,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-10-10 06:56:18,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=276126.6666666667, ans=0.125 2023-10-10 06:56:21,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=276126.6666666667, ans=0.125 2023-10-10 06:56:21,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-10-10 06:56:24,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-10-10 06:57:05,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=276313.3333333333, ans=0.125 2023-10-10 06:57:06,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=276313.3333333333, ans=0.0 2023-10-10 06:57:13,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-10-10 06:57:29,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=276406.6666666667, ans=0.125 2023-10-10 06:57:30,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.39 vs. limit=15.0 2023-10-10 06:57:40,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.729e+02 1.852e+02 2.035e+02 2.827e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-10 06:57:53,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=12.0 2023-10-10 06:58:01,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=276546.6666666667, ans=0.02 2023-10-10 06:58:02,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=276546.6666666667, ans=0.125 2023-10-10 06:58:07,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276593.3333333333, ans=0.1 2023-10-10 06:58:13,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=276640.0, ans=0.125 2023-10-10 06:58:18,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276640.0, ans=0.1 2023-10-10 06:58:24,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=276686.6666666667, ans=0.125 2023-10-10 06:58:29,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0 2023-10-10 06:58:33,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.48 vs. limit=22.5 2023-10-10 06:58:35,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-10-10 06:58:52,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.88 vs. limit=10.0 2023-10-10 06:59:00,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=276826.6666666667, ans=0.09899494936611666 2023-10-10 06:59:14,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-10 06:59:28,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.45 vs. limit=15.0 2023-10-10 06:59:29,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.775e+02 1.948e+02 2.176e+02 3.021e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-10 06:59:39,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=276966.6666666667, ans=0.125 2023-10-10 07:00:08,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-10-10 07:00:10,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=277106.6666666667, ans=0.125 2023-10-10 07:00:14,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=277153.3333333333, ans=0.2 2023-10-10 07:00:34,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=277246.6666666667, ans=0.0 2023-10-10 07:00:38,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=277246.6666666667, ans=0.125 2023-10-10 07:00:52,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=277293.3333333333, ans=0.0 2023-10-10 07:01:13,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.68 vs. limit=22.5 2023-10-10 07:01:19,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.781e+02 2.030e+02 2.372e+02 3.639e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-10 07:01:32,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=277433.3333333333, ans=0.0 2023-10-10 07:01:50,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=277526.6666666667, ans=0.125 2023-10-10 07:02:09,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277620.0, ans=0.1 2023-10-10 07:02:26,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=277713.3333333333, ans=0.95 2023-10-10 07:02:44,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.95 vs. limit=22.5 2023-10-10 07:02:55,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=277806.6666666667, ans=0.125 2023-10-10 07:02:59,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=277806.6666666667, ans=0.125 2023-10-10 07:03:10,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.863e+02 2.076e+02 2.543e+02 3.586e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-10 07:03:15,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277900.0, ans=0.1 2023-10-10 07:03:24,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=277900.0, ans=0.0 2023-10-10 07:03:34,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=277946.6666666667, ans=0.125 2023-10-10 07:03:47,587 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-10-10 07:03:51,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=278040.0, ans=0.125 2023-10-10 07:03:56,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=8.0 2023-10-10 07:04:22,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=278133.3333333333, ans=0.0 2023-10-10 07:04:23,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.67 vs. limit=22.5 2023-10-10 07:04:35,587 INFO [train.py:1031] (3/4) Epoch 5, batch 5000, loss[loss=0.2355, simple_loss=0.3217, pruned_loss=0.07466, over 15331.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3161, pruned_loss=0.07676, over 30125729.21 frames. ], batch size: 35, lr: 7.83e-03, grad_scale: 64.0 2023-10-10 07:05:07,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.790e+02 1.994e+02 2.169e+02 2.825e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-10 07:05:13,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-10-10 07:05:39,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=278506.6666666667, ans=0.125 2023-10-10 07:06:07,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=278600.0, ans=0.125 2023-10-10 07:06:46,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=278740.0, ans=0.125 2023-10-10 07:06:48,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=278740.0, ans=0.2 2023-10-10 07:06:56,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=278786.6666666667, ans=0.125 2023-10-10 07:07:00,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.803e+02 2.071e+02 2.479e+02 3.574e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-10 07:07:03,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=278833.3333333333, ans=0.07 2023-10-10 07:07:06,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=278833.3333333333, ans=0.125 2023-10-10 07:07:13,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=278880.0, ans=0.1 2023-10-10 07:07:13,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=278880.0, ans=0.125 2023-10-10 07:07:16,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=278880.0, ans=0.125 2023-10-10 07:07:21,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=278880.0, ans=0.0 2023-10-10 07:07:26,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=278926.6666666667, ans=0.0 2023-10-10 07:07:33,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=278926.6666666667, ans=0.2 2023-10-10 07:07:54,760 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.529e-03 2023-10-10 07:07:58,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=279066.6666666667, ans=0.05 2023-10-10 07:08:16,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279113.3333333333, ans=0.1 2023-10-10 07:08:24,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279160.0, ans=0.1 2023-10-10 07:08:32,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=279206.6666666667, ans=0.0 2023-10-10 07:08:41,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=279253.3333333333, ans=0.125 2023-10-10 07:08:43,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279253.3333333333, ans=0.1 2023-10-10 07:08:48,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.727e+02 1.897e+02 2.147e+02 3.329e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-10 07:09:03,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.28 vs. limit=10.0 2023-10-10 07:09:07,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=22.5 2023-10-10 07:09:26,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=279440.0, ans=0.125 2023-10-10 07:09:30,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-10-10 07:09:50,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=279533.3333333333, ans=0.5 2023-10-10 07:09:52,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=279533.3333333333, ans=0.0 2023-10-10 07:10:04,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=279580.0, ans=0.0 2023-10-10 07:10:04,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=279580.0, ans=0.0 2023-10-10 07:10:09,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=279580.0, ans=0.0 2023-10-10 07:10:16,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=22.5 2023-10-10 07:10:26,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=279673.3333333333, ans=0.125 2023-10-10 07:10:29,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=279673.3333333333, ans=0.125 2023-10-10 07:10:45,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.694e+02 1.853e+02 2.096e+02 2.977e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-10 07:10:46,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=279766.6666666667, ans=0.0 2023-10-10 07:10:53,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=279766.6666666667, ans=0.125 2023-10-10 07:11:05,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=279813.3333333333, ans=0.125 2023-10-10 07:11:07,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=279813.3333333333, ans=0.2 2023-10-10 07:11:25,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=279906.6666666667, ans=0.0 2023-10-10 07:11:26,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=279906.6666666667, ans=0.0 2023-10-10 07:11:30,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=279953.3333333333, ans=0.0 2023-10-10 07:11:36,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=279953.3333333333, ans=0.125 2023-10-10 07:11:40,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=279953.3333333333, ans=0.125 2023-10-10 07:12:12,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280093.3333333333, ans=0.1 2023-10-10 07:12:33,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.78 vs. limit=22.5 2023-10-10 07:12:34,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=280186.6666666667, ans=12.0 2023-10-10 07:12:36,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.702e+02 1.883e+02 2.175e+02 4.677e+02, threshold=3.765e+02, percent-clipped=2.0 2023-10-10 07:12:42,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280233.3333333333, ans=0.0 2023-10-10 07:12:49,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=280280.0, ans=0.125 2023-10-10 07:13:02,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.99 vs. limit=15.0 2023-10-10 07:13:13,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=280373.3333333333, ans=0.1 2023-10-10 07:13:13,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=280373.3333333333, ans=0.125 2023-10-10 07:13:22,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=280420.0, ans=0.125 2023-10-10 07:13:48,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=280513.3333333333, ans=0.035 2023-10-10 07:13:51,502 INFO [train.py:1031] (3/4) Epoch 5, batch 5500, loss[loss=0.2233, simple_loss=0.3033, pruned_loss=0.07167, over 16477.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3156, pruned_loss=0.07638, over 30710504.82 frames. ], batch size: 50, lr: 7.80e-03, grad_scale: 32.0 2023-10-10 07:14:02,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=280606.6666666667, ans=0.125 2023-10-10 07:14:21,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.771e+02 1.983e+02 2.415e+02 3.797e+02, threshold=3.966e+02, percent-clipped=1.0 2023-10-10 07:14:29,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=280700.0, ans=10.0 2023-10-10 07:14:29,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280700.0, ans=0.1 2023-10-10 07:14:50,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=280793.3333333333, ans=0.0 2023-10-10 07:15:09,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=6.0 2023-10-10 07:15:30,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=280980.0, ans=0.125 2023-10-10 07:15:37,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=15.0 2023-10-10 07:15:39,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281026.6666666667, ans=0.1 2023-10-10 07:16:08,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.755e+02 1.958e+02 2.183e+02 3.078e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-10 07:16:09,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=281166.6666666667, ans=0.0 2023-10-10 07:16:17,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2023-10-10 07:16:32,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=281260.0, ans=0.125 2023-10-10 07:17:02,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=281353.3333333333, ans=0.0 2023-10-10 07:17:18,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2023-10-10 07:17:23,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=281446.6666666667, ans=0.0 2023-10-10 07:17:25,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=281446.6666666667, ans=0.125 2023-10-10 07:17:32,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281493.3333333333, ans=0.1 2023-10-10 07:17:32,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=281493.3333333333, ans=0.0 2023-10-10 07:17:41,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=281540.0, ans=0.0 2023-10-10 07:17:51,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=281586.6666666667, ans=0.125 2023-10-10 07:18:01,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.735e+02 1.947e+02 2.338e+02 3.500e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 07:18:07,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=281633.3333333333, ans=0.0 2023-10-10 07:18:14,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=281680.0, ans=0.0 2023-10-10 07:18:17,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=281680.0, ans=0.0 2023-10-10 07:18:27,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=281726.6666666667, ans=0.025 2023-10-10 07:18:28,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-10-10 07:18:30,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=281726.6666666667, ans=0.125 2023-10-10 07:18:38,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.32 vs. limit=22.5 2023-10-10 07:18:51,582 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=22.5 2023-10-10 07:18:55,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=281820.0, ans=0.125 2023-10-10 07:19:35,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=282006.6666666667, ans=0.125 2023-10-10 07:19:40,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282006.6666666667, ans=0.1 2023-10-10 07:19:47,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.54 vs. limit=22.5 2023-10-10 07:19:50,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=282053.3333333333, ans=0.0 2023-10-10 07:19:55,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.844e+02 2.046e+02 2.399e+02 3.521e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-10 07:19:57,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.99 vs. limit=15.0 2023-10-10 07:20:03,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-10-10 07:20:09,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=282146.6666666667, ans=0.125 2023-10-10 07:20:29,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=282193.3333333333, ans=0.0 2023-10-10 07:20:52,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=282333.3333333333, ans=0.125 2023-10-10 07:20:54,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=282333.3333333333, ans=0.125 2023-10-10 07:20:58,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=282333.3333333333, ans=0.125 2023-10-10 07:20:58,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=282333.3333333333, ans=0.125 2023-10-10 07:21:05,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=282380.0, ans=0.125 2023-10-10 07:21:25,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2023-10-10 07:21:30,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=15.0 2023-10-10 07:21:49,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.741e+02 1.906e+02 2.171e+02 3.169e+02, threshold=3.812e+02, percent-clipped=0.0 2023-10-10 07:22:21,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=282706.6666666667, ans=0.125 2023-10-10 07:22:25,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=282706.6666666667, ans=0.035 2023-10-10 07:22:32,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282753.3333333333, ans=0.125 2023-10-10 07:22:33,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=282753.3333333333, ans=0.125 2023-10-10 07:22:41,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=282753.3333333333, ans=0.0 2023-10-10 07:22:50,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=282800.0, ans=0.125 2023-10-10 07:22:52,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282800.0, ans=0.1 2023-10-10 07:22:59,631 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-10-10 07:23:01,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.19 vs. limit=10.0 2023-10-10 07:23:06,549 INFO [train.py:1031] (3/4) Epoch 5, batch 6000, loss[loss=0.2644, simple_loss=0.3465, pruned_loss=0.09112, over 16839.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3158, pruned_loss=0.07636, over 31210626.43 frames. ], batch size: 175, lr: 7.77e-03, grad_scale: 32.0 2023-10-10 07:23:07,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=282893.3333333333, ans=0.125 2023-10-10 07:23:16,916 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:23:23,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.27 vs. limit=15.0 2023-10-10 07:23:39,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.785e+02 1.973e+02 2.403e+02 3.873e+02, threshold=3.946e+02, percent-clipped=1.0 2023-10-10 07:23:42,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283033.3333333333, ans=0.125 2023-10-10 07:24:06,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=283126.6666666667, ans=0.2 2023-10-10 07:24:11,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2023-10-10 07:24:40,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=283266.6666666667, ans=0.2 2023-10-10 07:24:46,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283313.3333333333, ans=0.0 2023-10-10 07:24:47,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283313.3333333333, ans=0.125 2023-10-10 07:24:50,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=283313.3333333333, ans=0.0 2023-10-10 07:24:54,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2023-10-10 07:25:00,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=283360.0, ans=0.2 2023-10-10 07:25:18,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=283453.3333333333, ans=0.09899494936611666 2023-10-10 07:25:21,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=283453.3333333333, ans=0.125 2023-10-10 07:25:28,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.722e+02 1.896e+02 2.244e+02 3.122e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-10 07:25:31,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=283500.0, ans=0.1 2023-10-10 07:26:12,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.01 vs. limit=22.5 2023-10-10 07:26:18,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.07 vs. limit=22.5 2023-10-10 07:26:23,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=283733.3333333333, ans=0.125 2023-10-10 07:26:24,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2023-10-10 07:26:29,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283733.3333333333, ans=0.0 2023-10-10 07:26:36,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=283780.0, ans=0.2 2023-10-10 07:26:42,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=283780.0, ans=0.2 2023-10-10 07:27:02,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=283873.3333333333, ans=0.125 2023-10-10 07:27:19,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.812e+02 2.050e+02 2.352e+02 3.885e+02, threshold=4.099e+02, percent-clipped=1.0 2023-10-10 07:27:20,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=283966.6666666667, ans=0.0 2023-10-10 07:27:22,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=283966.6666666667, ans=0.125 2023-10-10 07:27:51,092 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.31 vs. limit=10.0 2023-10-10 07:27:52,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=284106.6666666667, ans=0.0 2023-10-10 07:27:54,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=284106.6666666667, ans=0.0 2023-10-10 07:28:04,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=284153.3333333333, ans=0.0 2023-10-10 07:28:06,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=284153.3333333333, ans=0.1 2023-10-10 07:28:20,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=284200.0, ans=0.125 2023-10-10 07:28:22,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=284200.0, ans=0.125 2023-10-10 07:28:23,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=284246.6666666667, ans=0.125 2023-10-10 07:28:33,168 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2023-10-10 07:28:33,747 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:28:40,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=284293.3333333333, ans=0.125 2023-10-10 07:28:45,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=284340.0, ans=0.2 2023-10-10 07:29:05,484 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:29:07,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.849e+02 2.039e+02 2.474e+02 4.305e+02, threshold=4.078e+02, percent-clipped=1.0 2023-10-10 07:29:13,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=284433.3333333333, ans=0.2 2023-10-10 07:29:22,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=284480.0, ans=0.125 2023-10-10 07:29:22,356 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2023-10-10 07:29:32,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=284526.6666666667, ans=0.0 2023-10-10 07:29:49,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=284573.3333333333, ans=0.0 2023-10-10 07:29:58,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284620.0, ans=0.1 2023-10-10 07:30:10,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=284666.6666666667, ans=0.125 2023-10-10 07:30:20,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=284713.3333333333, ans=0.0 2023-10-10 07:30:22,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=284713.3333333333, ans=0.2 2023-10-10 07:30:24,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=284713.3333333333, ans=0.0 2023-10-10 07:30:29,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.18 vs. limit=15.0 2023-10-10 07:30:44,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-10-10 07:30:52,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=284853.3333333333, ans=0.125 2023-10-10 07:31:05,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.687e+02 1.799e+02 1.991e+02 2.851e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-10 07:31:08,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=284900.0, ans=0.07 2023-10-10 07:31:08,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-10-10 07:31:10,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=284900.0, ans=0.125 2023-10-10 07:31:23,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=284946.6666666667, ans=0.125 2023-10-10 07:31:30,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=284993.3333333333, ans=0.125 2023-10-10 07:31:48,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285086.6666666667, ans=0.1 2023-10-10 07:31:53,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=285086.6666666667, ans=0.125 2023-10-10 07:32:07,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=285133.3333333333, ans=0.125 2023-10-10 07:32:22,908 INFO [train.py:1031] (3/4) Epoch 5, batch 6500, loss[loss=0.2299, simple_loss=0.3195, pruned_loss=0.07014, over 16907.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3162, pruned_loss=0.07656, over 31524873.58 frames. ], batch size: 77, lr: 7.73e-03, grad_scale: 32.0 2023-10-10 07:32:31,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=285226.6666666667, ans=0.0 2023-10-10 07:32:50,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=285320.0, ans=0.125 2023-10-10 07:33:01,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.830e+02 2.204e+02 2.593e+02 4.019e+02, threshold=4.408e+02, percent-clipped=5.0 2023-10-10 07:33:21,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285413.3333333333, ans=0.1 2023-10-10 07:33:21,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=285413.3333333333, ans=0.125 2023-10-10 07:33:35,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=285460.0, ans=0.125 2023-10-10 07:34:00,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=285553.3333333333, ans=0.0 2023-10-10 07:34:35,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=285740.0, ans=0.125 2023-10-10 07:34:48,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=285786.6666666667, ans=0.05 2023-10-10 07:34:53,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=285833.3333333333, ans=0.0 2023-10-10 07:34:53,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.660e+02 1.921e+02 2.185e+02 4.191e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-10 07:34:55,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=285833.3333333333, ans=0.125 2023-10-10 07:35:09,240 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=15.0 2023-10-10 07:35:19,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=285926.6666666667, ans=0.07 2023-10-10 07:35:31,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-10 07:35:32,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=285973.3333333333, ans=0.125 2023-10-10 07:35:39,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=286020.0, ans=0.0 2023-10-10 07:35:47,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=286066.6666666667, ans=0.1 2023-10-10 07:36:00,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-10 07:36:03,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-10-10 07:36:09,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=286160.0, ans=0.0 2023-10-10 07:36:10,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286160.0, ans=0.1 2023-10-10 07:36:17,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=286206.6666666667, ans=0.0 2023-10-10 07:36:38,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=286253.3333333333, ans=0.2 2023-10-10 07:36:39,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.724e+02 1.910e+02 2.174e+02 3.138e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-10 07:36:50,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=286346.6666666667, ans=0.0 2023-10-10 07:36:55,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-10 07:37:06,029 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:37:13,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=286393.3333333333, ans=0.0 2023-10-10 07:37:16,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286440.0, ans=0.1 2023-10-10 07:37:37,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=286533.3333333333, ans=0.0 2023-10-10 07:37:38,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=286533.3333333333, ans=0.125 2023-10-10 07:37:52,380 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:38:46,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.794e+02 2.069e+02 2.480e+02 4.665e+02, threshold=4.138e+02, percent-clipped=2.0 2023-10-10 07:38:59,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=286813.3333333333, ans=0.2 2023-10-10 07:39:17,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.03 vs. limit=15.0 2023-10-10 07:39:27,978 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.73 vs. limit=22.5 2023-10-10 07:39:50,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=287000.0, ans=0.0 2023-10-10 07:39:55,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=287046.6666666667, ans=0.0 2023-10-10 07:40:23,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=287140.0, ans=0.2 2023-10-10 07:40:24,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=287140.0, ans=0.04949747468305833 2023-10-10 07:40:35,176 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:40:37,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=22.5 2023-10-10 07:40:37,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.752e+02 2.177e+02 2.583e+02 3.713e+02, threshold=4.354e+02, percent-clipped=0.0 2023-10-10 07:40:48,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=287280.0, ans=0.02 2023-10-10 07:41:08,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=12.0 2023-10-10 07:41:37,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2023-10-10 07:41:51,277 INFO [train.py:1031] (3/4) Epoch 5, batch 7000, loss[loss=0.2488, simple_loss=0.3267, pruned_loss=0.0854, over 16811.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3163, pruned_loss=0.07625, over 31826033.99 frames. ], batch size: 175, lr: 7.70e-03, grad_scale: 32.0 2023-10-10 07:41:51,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=287560.0, ans=0.025 2023-10-10 07:41:54,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-10 07:42:06,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=287606.6666666667, ans=0.125 2023-10-10 07:42:24,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-10 07:42:28,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.830e+02 2.100e+02 2.328e+02 3.721e+02, threshold=4.200e+02, percent-clipped=0.0 2023-10-10 07:42:32,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=287700.0, ans=0.125 2023-10-10 07:42:38,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-10-10 07:42:40,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=287746.6666666667, ans=0.125 2023-10-10 07:42:44,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=287746.6666666667, ans=0.125 2023-10-10 07:42:58,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=287793.3333333333, ans=0.2 2023-10-10 07:43:04,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=287840.0, ans=0.1 2023-10-10 07:43:16,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287886.6666666667, ans=0.1 2023-10-10 07:43:27,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=12.0 2023-10-10 07:43:30,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287980.0, ans=0.1 2023-10-10 07:43:35,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.01 vs. limit=22.5 2023-10-10 07:43:48,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.37 vs. limit=15.0 2023-10-10 07:44:09,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=288120.0, ans=0.2 2023-10-10 07:44:09,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=288120.0, ans=0.0 2023-10-10 07:44:15,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.756e+02 2.098e+02 2.489e+02 3.991e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-10 07:44:41,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=288260.0, ans=0.0 2023-10-10 07:44:54,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.41 vs. limit=12.0 2023-10-10 07:46:15,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.834e+02 2.018e+02 2.506e+02 4.228e+02, threshold=4.035e+02, percent-clipped=1.0 2023-10-10 07:46:36,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=288680.0, ans=0.125 2023-10-10 07:46:39,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=288680.0, ans=0.125 2023-10-10 07:47:01,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=288773.3333333333, ans=0.0 2023-10-10 07:47:17,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=288866.6666666667, ans=0.04949747468305833 2023-10-10 07:47:21,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=288866.6666666667, ans=0.125 2023-10-10 07:47:23,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-10-10 07:47:28,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=288913.3333333333, ans=0.0 2023-10-10 07:47:58,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=289006.6666666667, ans=0.0 2023-10-10 07:48:15,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.697e+02 1.903e+02 2.208e+02 3.102e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-10 07:48:50,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=22.5 2023-10-10 07:48:58,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=289240.0, ans=0.0 2023-10-10 07:49:10,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289333.3333333333, ans=0.1 2023-10-10 07:49:11,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289333.3333333333, ans=0.1 2023-10-10 07:49:12,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=289333.3333333333, ans=0.125 2023-10-10 07:49:21,736 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-10-10 07:49:32,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=6.0 2023-10-10 07:49:37,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=289426.6666666667, ans=0.125 2023-10-10 07:49:51,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=289473.3333333333, ans=0.0 2023-10-10 07:50:02,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=289520.0, ans=0.125 2023-10-10 07:50:04,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.737e+02 1.948e+02 2.143e+02 3.215e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-10 07:50:21,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-10-10 07:50:27,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289660.0, ans=0.1 2023-10-10 07:50:48,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=289753.3333333333, ans=0.2 2023-10-10 07:51:09,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=289846.6666666667, ans=0.02 2023-10-10 07:51:14,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-10-10 07:51:18,481 INFO [train.py:1031] (3/4) Epoch 5, batch 7500, loss[loss=0.2136, simple_loss=0.2876, pruned_loss=0.0698, over 16586.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.316, pruned_loss=0.07624, over 32017168.73 frames. ], batch size: 56, lr: 7.67e-03, grad_scale: 32.0 2023-10-10 07:51:20,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=289893.3333333333, ans=0.0 2023-10-10 07:51:27,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.16 vs. limit=22.5 2023-10-10 07:51:34,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=289940.0, ans=0.125 2023-10-10 07:51:39,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=289986.6666666667, ans=0.125 2023-10-10 07:51:47,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=289986.6666666667, ans=15.0 2023-10-10 07:51:50,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.794e+02 1.999e+02 2.279e+02 3.344e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-10 07:51:50,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=290033.3333333333, ans=0.1 2023-10-10 07:52:00,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290080.0, ans=0.1 2023-10-10 07:52:25,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=290173.3333333333, ans=0.125 2023-10-10 07:52:35,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=290220.0, ans=0.2 2023-10-10 07:52:41,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290220.0, ans=0.1 2023-10-10 07:52:48,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=290266.6666666667, ans=0.0 2023-10-10 07:52:59,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=290313.3333333333, ans=0.125 2023-10-10 07:53:07,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=290360.0, ans=10.0 2023-10-10 07:53:21,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=290406.6666666667, ans=0.125 2023-10-10 07:53:23,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=290406.6666666667, ans=0.2 2023-10-10 07:53:48,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.710e+02 1.882e+02 2.087e+02 2.889e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-10 07:53:49,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290500.0, ans=0.125 2023-10-10 07:54:19,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=290593.3333333333, ans=0.025 2023-10-10 07:54:24,358 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:54:56,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=290780.0, ans=0.125 2023-10-10 07:55:00,474 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-10-10 07:55:34,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=290920.0, ans=0.125 2023-10-10 07:55:37,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=290920.0, ans=0.0 2023-10-10 07:55:38,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=15.0 2023-10-10 07:55:42,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.742e+02 1.964e+02 2.310e+02 3.096e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 07:55:57,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.64 vs. limit=15.0 2023-10-10 07:56:33,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=291200.0, ans=0.07 2023-10-10 07:56:56,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291293.3333333333, ans=0.1 2023-10-10 07:57:01,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-10-10 07:57:01,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-10-10 07:57:02,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=291293.3333333333, ans=0.125 2023-10-10 07:57:31,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.758e+02 1.955e+02 2.122e+02 3.376e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-10 07:57:41,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=291433.3333333333, ans=0.125 2023-10-10 07:57:57,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=291526.6666666667, ans=0.0 2023-10-10 07:57:58,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=22.5 2023-10-10 07:58:02,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2023-10-10 07:58:07,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=291573.3333333333, ans=0.035 2023-10-10 07:58:11,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.40 vs. limit=6.0 2023-10-10 07:58:15,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=291620.0, ans=0.125 2023-10-10 07:58:23,674 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.83 vs. limit=10.0 2023-10-10 07:58:37,715 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:58:37,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2023-10-10 07:58:42,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=291713.3333333333, ans=0.0 2023-10-10 07:59:24,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.735e+02 1.905e+02 2.268e+02 3.127e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 07:59:40,143 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-10-10 07:59:53,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=291993.3333333333, ans=0.1 2023-10-10 08:00:00,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=292040.0, ans=0.2 2023-10-10 08:00:22,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=292133.3333333333, ans=0.2 2023-10-10 08:00:26,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292133.3333333333, ans=0.1 2023-10-10 08:00:45,262 INFO [train.py:1031] (3/4) Epoch 5, batch 8000, loss[loss=0.2339, simple_loss=0.3075, pruned_loss=0.08018, over 16059.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3152, pruned_loss=0.07552, over 32192474.62 frames. ], batch size: 296, lr: 7.64e-03, grad_scale: 64.0 2023-10-10 08:00:53,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=292226.6666666667, ans=0.0 2023-10-10 08:01:02,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-10 08:01:16,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.612e+02 1.755e+02 2.017e+02 2.749e+02, threshold=3.510e+02, percent-clipped=0.0 2023-10-10 08:01:23,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=15.0 2023-10-10 08:01:29,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=292413.3333333333, ans=0.125 2023-10-10 08:01:30,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=292413.3333333333, ans=0.125 2023-10-10 08:01:42,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=292460.0, ans=10.0 2023-10-10 08:02:02,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=292553.3333333333, ans=0.125 2023-10-10 08:02:04,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=292553.3333333333, ans=0.125 2023-10-10 08:02:11,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=292600.0, ans=0.2 2023-10-10 08:02:17,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=292600.0, ans=0.125 2023-10-10 08:02:27,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=292646.6666666667, ans=0.125 2023-10-10 08:02:39,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=292693.3333333333, ans=0.125 2023-10-10 08:03:03,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.804e+02 2.084e+02 2.499e+02 4.118e+02, threshold=4.169e+02, percent-clipped=4.0 2023-10-10 08:03:31,628 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-10-10 08:03:33,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=292926.6666666667, ans=0.125 2023-10-10 08:03:51,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292973.3333333333, ans=0.1 2023-10-10 08:03:58,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=292973.3333333333, ans=0.125 2023-10-10 08:04:09,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=293020.0, ans=0.2 2023-10-10 08:04:28,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.73 vs. limit=6.0 2023-10-10 08:04:29,048 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:04:44,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-10-10 08:04:44,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293160.0, ans=0.1 2023-10-10 08:05:02,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=293206.6666666667, ans=0.05 2023-10-10 08:05:10,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=293253.3333333333, ans=0.125 2023-10-10 08:05:10,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.61 vs. limit=15.0 2023-10-10 08:05:14,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=293300.0, ans=0.125 2023-10-10 08:05:14,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.807e+02 2.104e+02 2.432e+02 3.171e+02, threshold=4.208e+02, percent-clipped=0.0 2023-10-10 08:05:18,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=293300.0, ans=0.2 2023-10-10 08:05:47,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=293440.0, ans=0.125 2023-10-10 08:05:57,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293486.6666666667, ans=0.1 2023-10-10 08:06:10,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=293533.3333333333, ans=0.125 2023-10-10 08:06:39,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=293626.6666666667, ans=0.09899494936611666 2023-10-10 08:06:46,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=293673.3333333333, ans=0.0 2023-10-10 08:06:47,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=293673.3333333333, ans=0.035 2023-10-10 08:06:56,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.97 vs. limit=15.0 2023-10-10 08:07:04,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=293766.6666666667, ans=0.0 2023-10-10 08:07:06,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.725e+02 1.905e+02 2.247e+02 3.014e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-10 08:07:08,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=293766.6666666667, ans=0.125 2023-10-10 08:07:09,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=293766.6666666667, ans=0.95 2023-10-10 08:07:27,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293860.0, ans=0.1 2023-10-10 08:07:39,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=293906.6666666667, ans=0.125 2023-10-10 08:07:46,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293906.6666666667, ans=0.125 2023-10-10 08:08:14,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=294046.6666666667, ans=0.0 2023-10-10 08:08:40,682 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:09:01,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.759e+02 1.939e+02 2.509e+02 3.597e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-10 08:09:01,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=294233.3333333333, ans=0.04949747468305833 2023-10-10 08:09:27,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=294326.6666666667, ans=0.125 2023-10-10 08:09:39,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=294373.3333333333, ans=0.125 2023-10-10 08:09:50,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=294420.0, ans=0.2 2023-10-10 08:10:02,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=294466.6666666667, ans=0.2 2023-10-10 08:10:11,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=294513.3333333333, ans=0.125 2023-10-10 08:10:19,468 INFO [train.py:1031] (3/4) Epoch 5, batch 8500, loss[loss=0.2642, simple_loss=0.3405, pruned_loss=0.09393, over 16600.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3151, pruned_loss=0.07526, over 32302951.19 frames. ], batch size: 219, lr: 7.61e-03, grad_scale: 32.0 2023-10-10 08:10:19,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=294560.0, ans=0.125 2023-10-10 08:10:28,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=294560.0, ans=0.125 2023-10-10 08:10:55,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.854e+02 2.059e+02 2.413e+02 3.671e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-10 08:11:15,275 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.07 vs. limit=22.5 2023-10-10 08:11:24,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=294793.3333333333, ans=0.0 2023-10-10 08:11:36,977 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.67 vs. limit=15.0 2023-10-10 08:11:43,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=294886.6666666667, ans=0.2 2023-10-10 08:11:45,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=294886.6666666667, ans=0.125 2023-10-10 08:12:21,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-10 08:12:28,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-10-10 08:12:37,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=15.0 2023-10-10 08:12:47,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.92 vs. limit=22.5 2023-10-10 08:12:48,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=295120.0, ans=0.04949747468305833 2023-10-10 08:12:51,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.730e+02 2.024e+02 2.456e+02 4.460e+02, threshold=4.048e+02, percent-clipped=2.0 2023-10-10 08:12:53,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=295166.6666666667, ans=0.0 2023-10-10 08:12:53,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=295166.6666666667, ans=0.125 2023-10-10 08:13:01,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=295213.3333333333, ans=0.125 2023-10-10 08:13:01,333 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.00 vs. limit=15.0 2023-10-10 08:13:03,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=295213.3333333333, ans=0.0 2023-10-10 08:13:03,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=295213.3333333333, ans=0.2 2023-10-10 08:13:44,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=295353.3333333333, ans=0.07 2023-10-10 08:14:12,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=295446.6666666667, ans=0.125 2023-10-10 08:14:19,690 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.575e-02 2023-10-10 08:14:26,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=295540.0, ans=0.0 2023-10-10 08:14:48,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=295586.6666666667, ans=0.125 2023-10-10 08:14:48,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=295586.6666666667, ans=0.125 2023-10-10 08:14:53,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.694e+02 1.867e+02 2.096e+02 3.661e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-10 08:15:18,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=295726.6666666667, ans=0.0 2023-10-10 08:15:39,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=295773.3333333333, ans=0.1 2023-10-10 08:15:39,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-10-10 08:15:51,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=295866.6666666667, ans=0.0 2023-10-10 08:16:00,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=295866.6666666667, ans=0.0 2023-10-10 08:16:17,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=295960.0, ans=0.125 2023-10-10 08:16:29,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=296006.6666666667, ans=10.0 2023-10-10 08:16:29,722 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-10-10 08:16:48,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.181e+02 1.705e+02 1.934e+02 2.556e+02 3.710e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-10 08:17:02,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=296146.6666666667, ans=0.125 2023-10-10 08:17:07,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-10-10 08:17:12,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=296193.3333333333, ans=0.0 2023-10-10 08:17:12,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.43 vs. limit=22.5 2023-10-10 08:17:22,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=296240.0, ans=0.2 2023-10-10 08:17:38,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=296286.6666666667, ans=0.2 2023-10-10 08:17:44,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=12.0 2023-10-10 08:17:59,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=296380.0, ans=0.0 2023-10-10 08:18:00,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=296380.0, ans=0.1 2023-10-10 08:18:07,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.84 vs. limit=22.5 2023-10-10 08:18:30,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=296520.0, ans=0.125 2023-10-10 08:18:36,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=296566.6666666667, ans=0.125 2023-10-10 08:18:37,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.745e+02 1.930e+02 2.121e+02 3.550e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-10 08:18:48,414 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:19:01,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=296660.0, ans=0.0 2023-10-10 08:19:05,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=296706.6666666667, ans=0.125 2023-10-10 08:19:07,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.53 vs. limit=15.0 2023-10-10 08:19:25,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=296753.3333333333, ans=0.125 2023-10-10 08:19:25,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=296753.3333333333, ans=0.0 2023-10-10 08:19:29,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=296800.0, ans=0.0 2023-10-10 08:19:39,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=296846.6666666667, ans=0.0 2023-10-10 08:19:40,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296846.6666666667, ans=0.1 2023-10-10 08:19:40,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=296846.6666666667, ans=15.0 2023-10-10 08:19:51,023 INFO [train.py:1031] (3/4) Epoch 5, batch 9000, loss[loss=0.2403, simple_loss=0.3191, pruned_loss=0.08073, over 15488.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3143, pruned_loss=0.07475, over 32419640.04 frames. ], batch size: 35, lr: 7.58e-03, grad_scale: 16.0 2023-10-10 08:20:03,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2023-10-10 08:20:07,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=296940.0, ans=0.2 2023-10-10 08:20:11,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=296986.6666666667, ans=0.2 2023-10-10 08:20:12,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=296986.6666666667, ans=0.125 2023-10-10 08:20:21,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.08 vs. limit=6.0 2023-10-10 08:20:24,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-10 08:20:26,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.828e+02 1.977e+02 2.231e+02 3.067e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-10 08:20:32,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297080.0, ans=0.125 2023-10-10 08:20:35,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=297080.0, ans=0.125 2023-10-10 08:20:36,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297080.0, ans=0.1 2023-10-10 08:20:45,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=297126.6666666667, ans=0.2 2023-10-10 08:20:49,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=297126.6666666667, ans=0.125 2023-10-10 08:20:55,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=297173.3333333333, ans=0.0 2023-10-10 08:21:02,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2023-10-10 08:21:20,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=297266.6666666667, ans=0.125 2023-10-10 08:21:23,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=297313.3333333333, ans=0.125 2023-10-10 08:21:31,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-10-10 08:21:33,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=297313.3333333333, ans=0.0 2023-10-10 08:21:44,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=297406.6666666667, ans=0.125 2023-10-10 08:21:45,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=297406.6666666667, ans=0.125 2023-10-10 08:21:47,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=297406.6666666667, ans=0.125 2023-10-10 08:22:05,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=297500.0, ans=0.125 2023-10-10 08:22:05,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=297500.0, ans=0.0 2023-10-10 08:22:08,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.63 vs. limit=15.0 2023-10-10 08:22:08,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.704e+02 1.862e+02 2.145e+02 3.676e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-10 08:22:26,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=297593.3333333333, ans=0.2 2023-10-10 08:22:32,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=22.5 2023-10-10 08:23:08,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-10-10 08:23:22,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=297826.6666666667, ans=0.125 2023-10-10 08:23:26,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=297873.3333333333, ans=0.2 2023-10-10 08:23:27,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=297873.3333333333, ans=0.0 2023-10-10 08:23:28,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=297873.3333333333, ans=0.04949747468305833 2023-10-10 08:23:29,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=297873.3333333333, ans=0.0 2023-10-10 08:23:48,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=297966.6666666667, ans=0.1 2023-10-10 08:23:50,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=297966.6666666667, ans=0.125 2023-10-10 08:23:51,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.899e+02 2.023e+02 2.346e+02 3.303e+02, threshold=4.046e+02, percent-clipped=0.0 2023-10-10 08:24:09,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-10-10 08:24:22,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=298106.6666666667, ans=0.125 2023-10-10 08:24:35,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.04 vs. limit=6.0 2023-10-10 08:25:18,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.97 vs. limit=15.0 2023-10-10 08:25:30,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=298386.6666666667, ans=0.125 2023-10-10 08:25:34,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=298433.3333333333, ans=0.95 2023-10-10 08:25:37,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.800e+02 1.994e+02 2.234e+02 3.160e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-10 08:25:44,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=298433.3333333333, ans=0.125 2023-10-10 08:25:44,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=298433.3333333333, ans=0.0 2023-10-10 08:25:50,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2023-10-10 08:26:27,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=298620.0, ans=0.02 2023-10-10 08:26:36,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.34 vs. limit=15.0 2023-10-10 08:26:43,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=298666.6666666667, ans=0.0 2023-10-10 08:27:14,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=298806.6666666667, ans=0.125 2023-10-10 08:27:15,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2023-10-10 08:27:26,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298853.3333333333, ans=0.1 2023-10-10 08:27:39,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.875e+02 2.129e+02 2.516e+02 3.765e+02, threshold=4.257e+02, percent-clipped=0.0 2023-10-10 08:28:00,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=298993.3333333333, ans=0.125 2023-10-10 08:28:02,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=298993.3333333333, ans=0.0 2023-10-10 08:28:03,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=298993.3333333333, ans=0.125 2023-10-10 08:28:09,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-10-10 08:28:14,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=299040.0, ans=0.125 2023-10-10 08:28:14,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=299040.0, ans=0.1 2023-10-10 08:28:15,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=299040.0, ans=0.125 2023-10-10 08:28:38,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=299133.3333333333, ans=0.1 2023-10-10 08:28:49,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2023-10-10 08:28:52,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=299180.0, ans=0.2 2023-10-10 08:28:57,000 INFO [train.py:1031] (3/4) Epoch 5, batch 9500, loss[loss=0.2213, simple_loss=0.3016, pruned_loss=0.07046, over 16534.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3149, pruned_loss=0.07492, over 32504954.86 frames. ], batch size: 56, lr: 7.55e-03, grad_scale: 32.0 2023-10-10 08:29:00,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2023-10-10 08:29:13,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299273.3333333333, ans=0.1 2023-10-10 08:29:25,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=299320.0, ans=0.0 2023-10-10 08:29:27,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=299320.0, ans=0.125 2023-10-10 08:29:31,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.714e+02 1.907e+02 2.140e+02 2.989e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-10 08:29:32,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-10-10 08:29:35,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=299366.6666666667, ans=0.0 2023-10-10 08:30:02,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.19 vs. limit=15.0 2023-10-10 08:30:12,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299553.3333333333, ans=0.1 2023-10-10 08:30:51,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=299693.3333333333, ans=0.0 2023-10-10 08:30:52,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=299693.3333333333, ans=0.0 2023-10-10 08:30:53,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=299693.3333333333, ans=0.125 2023-10-10 08:30:58,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=299740.0, ans=0.2 2023-10-10 08:31:04,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=299740.0, ans=0.125 2023-10-10 08:31:09,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=299786.6666666667, ans=0.0 2023-10-10 08:31:15,587 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-10 08:31:20,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=299833.3333333333, ans=0.0 2023-10-10 08:31:22,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.772e+02 1.968e+02 2.304e+02 4.066e+02, threshold=3.936e+02, percent-clipped=1.0 2023-10-10 08:31:25,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=299833.3333333333, ans=0.0 2023-10-10 08:31:43,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=299926.6666666667, ans=0.125 2023-10-10 08:31:44,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=299926.6666666667, ans=0.2 2023-10-10 08:31:53,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=299973.3333333333, ans=0.125 2023-10-10 08:31:55,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=299973.3333333333, ans=0.125 2023-10-10 08:31:59,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299973.3333333333, ans=0.1 2023-10-10 08:32:02,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299973.3333333333, ans=0.1 2023-10-10 08:32:14,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=300020.0, ans=0.0 2023-10-10 08:32:18,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.35 vs. limit=22.5 2023-10-10 08:32:20,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.48 vs. limit=15.0 2023-10-10 08:32:38,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300160.0, ans=0.1 2023-10-10 08:32:38,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-10 08:33:09,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-10 08:33:13,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.653e+02 1.883e+02 2.134e+02 3.924e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-10 08:33:17,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=300300.0, ans=0.0 2023-10-10 08:33:19,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=300346.6666666667, ans=0.125 2023-10-10 08:33:24,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=300346.6666666667, ans=0.0 2023-10-10 08:33:24,982 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-10 08:33:28,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=12.0 2023-10-10 08:33:47,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=300440.0, ans=0.0 2023-10-10 08:33:50,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=300440.0, ans=0.125 2023-10-10 08:33:58,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=300486.6666666667, ans=0.025 2023-10-10 08:34:01,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=300486.6666666667, ans=0.125 2023-10-10 08:34:05,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=300533.3333333333, ans=0.125 2023-10-10 08:34:19,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=300580.0, ans=15.0 2023-10-10 08:34:26,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=300580.0, ans=0.125 2023-10-10 08:34:32,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=300626.6666666667, ans=0.05 2023-10-10 08:34:58,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=22.5 2023-10-10 08:35:02,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.736e+02 1.970e+02 2.215e+02 3.451e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-10 08:35:20,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-10-10 08:35:33,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=300906.6666666667, ans=0.09899494936611666 2023-10-10 08:35:46,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=300953.3333333333, ans=0.125 2023-10-10 08:35:56,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=301000.0, ans=0.0 2023-10-10 08:36:03,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=301046.6666666667, ans=0.07 2023-10-10 08:36:04,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=301046.6666666667, ans=0.125 2023-10-10 08:36:06,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=301046.6666666667, ans=0.0 2023-10-10 08:36:27,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=301140.0, ans=0.02 2023-10-10 08:36:34,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=15.0 2023-10-10 08:36:49,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.695e+02 1.862e+02 2.114e+02 3.045e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-10 08:36:52,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-10-10 08:37:17,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=301373.3333333333, ans=0.125 2023-10-10 08:37:20,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=301373.3333333333, ans=0.0 2023-10-10 08:37:46,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=301466.6666666667, ans=0.125 2023-10-10 08:37:51,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=301513.3333333333, ans=0.2 2023-10-10 08:37:55,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=301513.3333333333, ans=0.125 2023-10-10 08:37:58,958 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.581e-02 2023-10-10 08:38:00,550 INFO [train.py:1031] (3/4) Epoch 5, batch 10000, loss[loss=0.2311, simple_loss=0.3157, pruned_loss=0.07323, over 16919.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3142, pruned_loss=0.07466, over 32578651.23 frames. ], batch size: 93, lr: 7.52e-03, grad_scale: 32.0 2023-10-10 08:38:13,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=301606.6666666667, ans=0.0 2023-10-10 08:38:17,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=301606.6666666667, ans=0.1 2023-10-10 08:38:20,413 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:38:22,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-10 08:38:33,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.726e+02 1.926e+02 2.197e+02 3.835e+02, threshold=3.852e+02, percent-clipped=1.0 2023-10-10 08:38:53,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=301793.3333333333, ans=0.2 2023-10-10 08:39:05,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=301840.0, ans=0.125 2023-10-10 08:39:11,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=301886.6666666667, ans=0.07 2023-10-10 08:39:21,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=301886.6666666667, ans=0.2 2023-10-10 08:39:29,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=301933.3333333333, ans=0.2 2023-10-10 08:39:31,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=301933.3333333333, ans=0.1 2023-10-10 08:39:39,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-10-10 08:39:40,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=301980.0, ans=0.125 2023-10-10 08:39:54,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=302026.6666666667, ans=0.125 2023-10-10 08:39:58,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302026.6666666667, ans=0.1 2023-10-10 08:40:03,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-10-10 08:40:04,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=302073.3333333333, ans=0.125 2023-10-10 08:40:17,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=302120.0, ans=0.09899494936611666 2023-10-10 08:40:23,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=302166.6666666667, ans=0.125 2023-10-10 08:40:25,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302166.6666666667, ans=0.1 2023-10-10 08:40:26,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.892e+02 2.164e+02 2.636e+02 3.771e+02, threshold=4.328e+02, percent-clipped=0.0 2023-10-10 08:40:36,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=302213.3333333333, ans=0.05 2023-10-10 08:40:59,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=302306.6666666667, ans=0.0 2023-10-10 08:41:08,069 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:41:19,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.37 vs. limit=5.0 2023-10-10 08:41:19,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=302400.0, ans=0.2 2023-10-10 08:41:36,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302493.3333333333, ans=0.1 2023-10-10 08:41:44,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=302493.3333333333, ans=0.125 2023-10-10 08:42:09,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=302586.6666666667, ans=0.0 2023-10-10 08:42:15,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=302633.3333333333, ans=0.0 2023-10-10 08:42:16,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-10-10 08:42:18,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.749e+02 2.002e+02 2.228e+02 3.571e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-10 08:42:26,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=302680.0, ans=0.125 2023-10-10 08:42:27,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=302680.0, ans=0.2 2023-10-10 08:42:32,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302680.0, ans=0.1 2023-10-10 08:42:37,333 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:42:37,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.82 vs. limit=15.0 2023-10-10 08:42:38,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=302726.6666666667, ans=0.125 2023-10-10 08:42:44,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=302726.6666666667, ans=0.125 2023-10-10 08:43:11,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=302866.6666666667, ans=0.125 2023-10-10 08:43:32,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2023-10-10 08:43:42,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=303006.6666666667, ans=0.2 2023-10-10 08:43:53,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.79 vs. limit=22.5 2023-10-10 08:44:04,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=303100.0, ans=0.2 2023-10-10 08:44:09,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.754e+02 1.917e+02 2.185e+02 3.651e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-10 08:44:16,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=303146.6666666667, ans=0.125 2023-10-10 08:44:27,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=303146.6666666667, ans=0.2 2023-10-10 08:44:35,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2023-10-10 08:44:40,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-10-10 08:44:43,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=303240.0, ans=0.0 2023-10-10 08:44:52,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=303286.6666666667, ans=0.04949747468305833 2023-10-10 08:45:21,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=303380.0, ans=0.0 2023-10-10 08:45:26,502 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:45:27,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=303426.6666666667, ans=0.125 2023-10-10 08:45:29,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-10-10 08:45:33,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.67 vs. limit=10.0 2023-10-10 08:45:36,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=303473.3333333333, ans=15.0 2023-10-10 08:45:37,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=303473.3333333333, ans=0.125 2023-10-10 08:45:41,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=303473.3333333333, ans=0.0 2023-10-10 08:45:42,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=303473.3333333333, ans=0.125 2023-10-10 08:45:58,912 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2023-10-10 08:46:04,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.745e+02 1.944e+02 2.297e+02 2.983e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-10 08:46:26,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=303660.0, ans=0.125 2023-10-10 08:46:28,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=15.0 2023-10-10 08:46:37,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303706.6666666667, ans=0.1 2023-10-10 08:46:41,670 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:46:48,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2023-10-10 08:46:54,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.67 vs. limit=10.0 2023-10-10 08:47:17,276 INFO [train.py:1031] (3/4) Epoch 5, batch 10500, loss[loss=0.2218, simple_loss=0.3081, pruned_loss=0.06775, over 16963.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3146, pruned_loss=0.07465, over 32659276.88 frames. ], batch size: 123, lr: 7.50e-03, grad_scale: 32.0 2023-10-10 08:47:22,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=303893.3333333333, ans=0.125 2023-10-10 08:47:52,670 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.710e+02 2.026e+02 2.351e+02 3.946e+02, threshold=4.051e+02, percent-clipped=1.0 2023-10-10 08:47:54,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.02 vs. limit=22.5 2023-10-10 08:47:55,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=304033.3333333333, ans=0.125 2023-10-10 08:47:58,580 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-10-10 08:48:12,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=304126.6666666667, ans=0.0 2023-10-10 08:48:18,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=304126.6666666667, ans=0.2 2023-10-10 08:48:27,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=304173.3333333333, ans=0.125 2023-10-10 08:48:29,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=304173.3333333333, ans=0.0 2023-10-10 08:48:33,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=304220.0, ans=0.0 2023-10-10 08:48:53,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=304266.6666666667, ans=0.125 2023-10-10 08:49:18,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=304360.0, ans=0.0 2023-10-10 08:49:19,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=304360.0, ans=0.125 2023-10-10 08:49:38,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=304453.3333333333, ans=0.125 2023-10-10 08:49:47,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=304500.0, ans=0.2 2023-10-10 08:49:47,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=304500.0, ans=0.125 2023-10-10 08:49:51,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.765e+02 2.005e+02 2.243e+02 3.603e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-10 08:49:57,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=304500.0, ans=0.125 2023-10-10 08:50:03,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=304546.6666666667, ans=0.125 2023-10-10 08:50:06,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=304546.6666666667, ans=0.0 2023-10-10 08:50:07,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-10 08:50:24,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=304640.0, ans=0.2 2023-10-10 08:50:28,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=304640.0, ans=0.0 2023-10-10 08:51:09,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.77 vs. limit=22.5 2023-10-10 08:51:19,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=304826.6666666667, ans=0.0 2023-10-10 08:51:46,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.820e+02 2.072e+02 2.371e+02 3.456e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-10 08:51:55,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=305013.3333333333, ans=0.125 2023-10-10 08:52:04,990 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:52:44,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=305200.0, ans=0.125 2023-10-10 08:52:46,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=305200.0, ans=0.02 2023-10-10 08:53:06,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=305293.3333333333, ans=0.125 2023-10-10 08:53:13,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=305340.0, ans=0.125 2023-10-10 08:53:20,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=305386.6666666667, ans=0.0 2023-10-10 08:53:23,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305386.6666666667, ans=0.1 2023-10-10 08:53:26,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=305386.6666666667, ans=0.125 2023-10-10 08:53:34,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.816e+02 2.097e+02 2.281e+02 3.481e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-10 08:53:37,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2023-10-10 08:53:37,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=305433.3333333333, ans=0.125 2023-10-10 08:53:38,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.34 vs. limit=10.0 2023-10-10 08:53:56,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-10-10 08:54:05,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2023-10-10 08:54:09,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-10-10 08:54:15,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=305620.0, ans=0.125 2023-10-10 08:54:20,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=305620.0, ans=0.125 2023-10-10 08:54:30,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=305666.6666666667, ans=0.05 2023-10-10 08:54:34,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=305666.6666666667, ans=0.125 2023-10-10 08:54:43,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=305713.3333333333, ans=0.07 2023-10-10 08:54:46,770 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:54:52,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.11 vs. limit=10.0 2023-10-10 08:55:02,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=305806.6666666667, ans=0.0 2023-10-10 08:55:03,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=305806.6666666667, ans=0.0 2023-10-10 08:55:14,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=305853.3333333333, ans=0.0 2023-10-10 08:55:23,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.279e+02 1.686e+02 1.922e+02 2.175e+02 3.150e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-10 08:55:31,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=305946.6666666667, ans=0.125 2023-10-10 08:56:03,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=306086.6666666667, ans=0.125 2023-10-10 08:56:07,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-10-10 08:56:09,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2023-10-10 08:56:12,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2023-10-10 08:56:35,319 INFO [train.py:1031] (3/4) Epoch 5, batch 11000, loss[loss=0.2219, simple_loss=0.3156, pruned_loss=0.06408, over 16905.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3143, pruned_loss=0.07463, over 32657907.12 frames. ], batch size: 138, lr: 7.47e-03, grad_scale: 32.0 2023-10-10 08:56:39,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=12.0 2023-10-10 08:57:01,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=306320.0, ans=0.07 2023-10-10 08:57:03,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=306320.0, ans=0.1 2023-10-10 08:57:04,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=306366.6666666667, ans=0.2 2023-10-10 08:57:10,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.787e+02 2.089e+02 2.494e+02 3.681e+02, threshold=4.178e+02, percent-clipped=0.0 2023-10-10 08:57:13,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=306366.6666666667, ans=0.125 2023-10-10 08:57:15,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=306413.3333333333, ans=0.0 2023-10-10 08:57:21,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=306413.3333333333, ans=0.125 2023-10-10 08:57:23,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=306413.3333333333, ans=0.125 2023-10-10 08:57:32,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=306460.0, ans=0.125 2023-10-10 08:57:52,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.88 vs. limit=10.0 2023-10-10 08:57:54,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=306553.3333333333, ans=0.015 2023-10-10 08:57:57,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=306553.3333333333, ans=0.125 2023-10-10 08:58:16,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306646.6666666667, ans=0.1 2023-10-10 08:58:16,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=306646.6666666667, ans=0.04949747468305833 2023-10-10 08:58:26,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=306693.3333333333, ans=0.0 2023-10-10 08:58:41,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=306740.0, ans=0.125 2023-10-10 08:59:06,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.304e+02 1.714e+02 1.986e+02 2.320e+02 3.498e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-10 08:59:07,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306833.3333333333, ans=0.1 2023-10-10 08:59:28,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=306926.6666666667, ans=0.0 2023-10-10 08:59:29,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=306926.6666666667, ans=0.0 2023-10-10 08:59:31,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=306926.6666666667, ans=0.0 2023-10-10 08:59:39,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=306973.3333333333, ans=0.125 2023-10-10 09:00:20,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=307113.3333333333, ans=0.125 2023-10-10 09:00:29,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=307160.0, ans=0.2 2023-10-10 09:00:31,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.52 vs. limit=15.0 2023-10-10 09:00:58,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.655e+02 1.862e+02 2.211e+02 3.402e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-10 09:01:02,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=307300.0, ans=0.2 2023-10-10 09:01:06,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.96 vs. limit=6.0 2023-10-10 09:01:18,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2023-10-10 09:01:21,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=307393.3333333333, ans=0.0 2023-10-10 09:01:27,467 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:01:29,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=307440.0, ans=0.0 2023-10-10 09:01:49,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307486.6666666667, ans=0.1 2023-10-10 09:01:49,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-10-10 09:02:19,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=307626.6666666667, ans=0.2 2023-10-10 09:02:19,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=307626.6666666667, ans=0.1 2023-10-10 09:02:27,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=307673.3333333333, ans=0.125 2023-10-10 09:02:28,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=307673.3333333333, ans=0.125 2023-10-10 09:02:38,230 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:02:51,534 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:02:52,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.667e+02 1.872e+02 2.190e+02 3.334e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-10 09:03:08,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=307860.0, ans=0.125 2023-10-10 09:03:11,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=307860.0, ans=0.0 2023-10-10 09:03:12,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=307860.0, ans=0.125 2023-10-10 09:03:13,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-10 09:03:36,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=307953.3333333333, ans=0.2 2023-10-10 09:03:44,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=308000.0, ans=0.0 2023-10-10 09:04:05,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=308093.3333333333, ans=0.0 2023-10-10 09:04:16,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=308140.0, ans=0.1 2023-10-10 09:04:25,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=308140.0, ans=0.125 2023-10-10 09:04:27,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=308186.6666666667, ans=0.125 2023-10-10 09:04:43,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.841e+02 2.114e+02 2.376e+02 2.991e+02, threshold=4.229e+02, percent-clipped=0.0 2023-10-10 09:05:55,099 INFO [train.py:1031] (3/4) Epoch 5, batch 11500, loss[loss=0.2545, simple_loss=0.3358, pruned_loss=0.08661, over 15506.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3138, pruned_loss=0.07439, over 32685010.92 frames. ], batch size: 35, lr: 7.44e-03, grad_scale: 32.0 2023-10-10 09:06:22,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=308653.3333333333, ans=0.0 2023-10-10 09:06:30,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.867e+02 1.980e+02 2.225e+02 2.951e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-10 09:06:32,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=308700.0, ans=0.125 2023-10-10 09:06:40,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=308746.6666666667, ans=0.0 2023-10-10 09:06:43,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=308746.6666666667, ans=0.125 2023-10-10 09:06:43,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=22.5 2023-10-10 09:06:47,626 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:06:50,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=308793.3333333333, ans=0.125 2023-10-10 09:07:07,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=308840.0, ans=0.125 2023-10-10 09:07:07,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=308840.0, ans=0.125 2023-10-10 09:07:08,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-10 09:07:10,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-10-10 09:07:15,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=308886.6666666667, ans=0.125 2023-10-10 09:07:20,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.68 vs. limit=22.5 2023-10-10 09:07:21,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.25 vs. limit=15.0 2023-10-10 09:07:31,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=308933.3333333333, ans=0.0 2023-10-10 09:07:44,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=308980.0, ans=0.125 2023-10-10 09:07:45,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.02 vs. limit=10.0 2023-10-10 09:08:23,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=309120.0, ans=0.2 2023-10-10 09:08:26,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-10-10 09:08:30,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=309166.6666666667, ans=0.125 2023-10-10 09:08:32,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.687e+02 1.838e+02 2.019e+02 3.151e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-10 09:08:37,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=309166.6666666667, ans=0.0 2023-10-10 09:08:42,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309213.3333333333, ans=0.1 2023-10-10 09:08:47,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=309260.0, ans=0.1 2023-10-10 09:08:51,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=309260.0, ans=0.0 2023-10-10 09:08:51,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=309260.0, ans=0.0 2023-10-10 09:09:17,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=309353.3333333333, ans=0.07 2023-10-10 09:09:17,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=309353.3333333333, ans=0.0 2023-10-10 09:09:39,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.79 vs. limit=5.0 2023-10-10 09:09:43,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=309493.3333333333, ans=0.0 2023-10-10 09:10:14,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=309586.6666666667, ans=0.125 2023-10-10 09:10:15,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=309633.3333333333, ans=0.125 2023-10-10 09:10:19,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.801e+02 2.107e+02 2.332e+02 3.479e+02, threshold=4.215e+02, percent-clipped=0.0 2023-10-10 09:10:22,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=309633.3333333333, ans=0.125 2023-10-10 09:10:51,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=309726.6666666667, ans=0.125 2023-10-10 09:11:07,895 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:11:20,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-10-10 09:11:36,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=309913.3333333333, ans=0.2 2023-10-10 09:11:37,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-10-10 09:11:40,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=15.0 2023-10-10 09:11:45,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.47 vs. limit=12.0 2023-10-10 09:11:51,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309960.0, ans=0.1 2023-10-10 09:11:53,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=309960.0, ans=0.0 2023-10-10 09:12:02,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=310006.6666666667, ans=0.125 2023-10-10 09:12:23,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=310100.0, ans=0.07 2023-10-10 09:12:27,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.695e+02 1.871e+02 2.202e+02 3.745e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-10 09:12:51,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=310193.3333333333, ans=0.125 2023-10-10 09:12:59,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=310240.0, ans=0.2 2023-10-10 09:13:04,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.47 vs. limit=10.0 2023-10-10 09:13:30,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=310380.0, ans=0.125 2023-10-10 09:13:38,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=310380.0, ans=0.125 2023-10-10 09:13:42,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=310426.6666666667, ans=0.125 2023-10-10 09:13:49,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=310426.6666666667, ans=0.2 2023-10-10 09:13:54,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=310473.3333333333, ans=0.125 2023-10-10 09:14:06,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=310520.0, ans=0.05 2023-10-10 09:14:21,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.731e+02 1.956e+02 2.274e+02 4.291e+02, threshold=3.913e+02, percent-clipped=3.0 2023-10-10 09:14:22,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=310566.6666666667, ans=0.125 2023-10-10 09:14:32,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=310613.3333333333, ans=0.125 2023-10-10 09:14:39,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310660.0, ans=0.125 2023-10-10 09:14:41,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.44 vs. limit=10.0 2023-10-10 09:14:57,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=310706.6666666667, ans=0.125 2023-10-10 09:15:05,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=310753.3333333333, ans=0.1 2023-10-10 09:15:07,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-10-10 09:15:13,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=310800.0, ans=0.1 2023-10-10 09:15:29,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=310846.6666666667, ans=0.125 2023-10-10 09:15:32,689 INFO [train.py:1031] (3/4) Epoch 5, batch 12000, loss[loss=0.2515, simple_loss=0.334, pruned_loss=0.08453, over 16977.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3134, pruned_loss=0.07378, over 32716258.07 frames. ], batch size: 117, lr: 7.41e-03, grad_scale: 32.0 2023-10-10 09:15:36,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.88 vs. limit=15.0 2023-10-10 09:15:55,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=310986.6666666667, ans=0.125 2023-10-10 09:16:07,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311033.3333333333, ans=0.1 2023-10-10 09:16:09,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.693e+02 1.865e+02 2.201e+02 3.007e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-10 09:16:14,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=311033.3333333333, ans=0.0 2023-10-10 09:16:18,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=311080.0, ans=0.125 2023-10-10 09:16:18,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-10-10 09:16:37,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-10-10 09:16:56,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2023-10-10 09:17:00,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=311220.0, ans=0.0 2023-10-10 09:17:52,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=311453.3333333333, ans=0.125 2023-10-10 09:18:02,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.809e+02 2.102e+02 2.599e+02 4.198e+02, threshold=4.205e+02, percent-clipped=2.0 2023-10-10 09:18:03,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=311500.0, ans=0.125 2023-10-10 09:18:05,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=311500.0, ans=0.0 2023-10-10 09:18:16,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=311546.6666666667, ans=0.2 2023-10-10 09:18:32,654 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:18:36,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=311686.6666666667, ans=0.0 2023-10-10 09:18:44,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311686.6666666667, ans=0.125 2023-10-10 09:18:51,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=311733.3333333333, ans=0.0 2023-10-10 09:18:52,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=22.5 2023-10-10 09:18:59,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2023-10-10 09:19:14,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=311826.6666666667, ans=10.0 2023-10-10 09:19:34,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=311920.0, ans=0.125 2023-10-10 09:19:44,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.795e+02 1.983e+02 2.315e+02 3.227e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-10 09:19:50,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.17 vs. limit=10.0 2023-10-10 09:19:54,438 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:19:55,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=312013.3333333333, ans=0.125 2023-10-10 09:20:11,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=22.5 2023-10-10 09:20:25,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=312153.3333333333, ans=0.0 2023-10-10 09:20:45,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=12.0 2023-10-10 09:21:00,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.23 vs. limit=22.5 2023-10-10 09:21:02,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=312293.3333333333, ans=0.2 2023-10-10 09:21:30,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=312433.3333333333, ans=0.0 2023-10-10 09:21:31,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.829e+02 2.024e+02 2.357e+02 3.379e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-10 09:21:31,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=312433.3333333333, ans=0.2 2023-10-10 09:21:48,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=312526.6666666667, ans=0.125 2023-10-10 09:21:50,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=312526.6666666667, ans=0.125 2023-10-10 09:22:10,868 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=15.0 2023-10-10 09:22:34,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2023-10-10 09:22:38,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=312713.3333333333, ans=0.07 2023-10-10 09:22:40,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.37 vs. limit=15.0 2023-10-10 09:22:46,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.04 vs. limit=15.0 2023-10-10 09:22:53,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-10-10 09:23:19,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.843e+02 1.953e+02 2.196e+02 2.912e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-10 09:23:23,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=312900.0, ans=0.125 2023-10-10 09:23:31,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=312946.6666666667, ans=0.0 2023-10-10 09:23:39,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-10-10 09:23:52,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=313040.0, ans=0.0 2023-10-10 09:23:57,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=313040.0, ans=0.0 2023-10-10 09:24:01,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-10-10 09:24:01,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.65 vs. limit=15.0 2023-10-10 09:24:04,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=313086.6666666667, ans=0.0 2023-10-10 09:24:07,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=313086.6666666667, ans=0.125 2023-10-10 09:24:26,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-10-10 09:24:33,940 INFO [train.py:1031] (3/4) Epoch 5, batch 12500, loss[loss=0.2189, simple_loss=0.306, pruned_loss=0.06596, over 16935.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3132, pruned_loss=0.07377, over 32754818.97 frames. ], batch size: 77, lr: 7.39e-03, grad_scale: 32.0 2023-10-10 09:24:37,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-10-10 09:24:45,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=313273.3333333333, ans=0.125 2023-10-10 09:25:10,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.744e+02 1.955e+02 2.361e+02 3.239e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-10 09:25:11,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313366.6666666667, ans=0.0 2023-10-10 09:25:12,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-10-10 09:25:20,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-10-10 09:25:23,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=313413.3333333333, ans=0.125 2023-10-10 09:25:31,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=313460.0, ans=0.0 2023-10-10 09:25:34,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=313460.0, ans=0.2 2023-10-10 09:25:48,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=12.0 2023-10-10 09:25:54,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-10-10 09:25:59,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=313600.0, ans=0.125 2023-10-10 09:26:16,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=313646.6666666667, ans=0.0 2023-10-10 09:26:23,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=313693.3333333333, ans=0.2 2023-10-10 09:26:43,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=313786.6666666667, ans=0.125 2023-10-10 09:26:51,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=313786.6666666667, ans=0.0 2023-10-10 09:26:58,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=313833.3333333333, ans=0.0 2023-10-10 09:27:00,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.697e+02 1.906e+02 2.139e+02 3.055e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-10 09:27:05,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=313880.0, ans=0.2 2023-10-10 09:27:30,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=313973.3333333333, ans=0.125 2023-10-10 09:27:33,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.89 vs. limit=15.0 2023-10-10 09:27:52,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=314066.6666666667, ans=0.0 2023-10-10 09:27:54,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=314066.6666666667, ans=0.0 2023-10-10 09:27:59,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-10-10 09:28:20,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.31 vs. limit=22.5 2023-10-10 09:28:24,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=314206.6666666667, ans=0.125 2023-10-10 09:28:40,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314253.3333333333, ans=0.125 2023-10-10 09:28:49,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.756e+02 2.008e+02 2.268e+02 4.172e+02, threshold=4.016e+02, percent-clipped=2.0 2023-10-10 09:29:10,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=314393.3333333333, ans=0.2 2023-10-10 09:29:20,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=314440.0, ans=0.125 2023-10-10 09:29:22,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=314440.0, ans=0.2 2023-10-10 09:29:32,461 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:30:05,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314626.6666666667, ans=0.1 2023-10-10 09:30:18,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314673.3333333333, ans=0.125 2023-10-10 09:30:28,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=314720.0, ans=0.125 2023-10-10 09:30:34,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=314766.6666666667, ans=0.125 2023-10-10 09:30:35,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.754e+02 1.941e+02 2.289e+02 4.469e+02, threshold=3.883e+02, percent-clipped=1.0 2023-10-10 09:30:51,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=314813.3333333333, ans=0.0 2023-10-10 09:31:22,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=314953.3333333333, ans=0.0 2023-10-10 09:31:29,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=22.5 2023-10-10 09:31:29,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-10-10 09:31:29,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.03 vs. limit=15.0 2023-10-10 09:31:32,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=315000.0, ans=0.125 2023-10-10 09:31:57,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=315093.3333333333, ans=0.2 2023-10-10 09:32:08,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=315140.0, ans=0.0 2023-10-10 09:32:10,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315186.6666666667, ans=0.1 2023-10-10 09:32:10,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=315186.6666666667, ans=0.1 2023-10-10 09:32:11,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=315186.6666666667, ans=0.125 2023-10-10 09:32:17,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2023-10-10 09:32:26,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.699e+02 1.925e+02 2.188e+02 3.283e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-10 09:32:43,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=315326.6666666667, ans=0.125 2023-10-10 09:32:56,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=315373.3333333333, ans=0.125 2023-10-10 09:32:59,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=315373.3333333333, ans=0.1 2023-10-10 09:33:02,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315420.0, ans=0.125 2023-10-10 09:33:05,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=315420.0, ans=0.0 2023-10-10 09:33:14,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=315466.6666666667, ans=0.09899494936611666 2023-10-10 09:33:16,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=315466.6666666667, ans=0.0 2023-10-10 09:33:20,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-10-10 09:33:22,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=315466.6666666667, ans=0.125 2023-10-10 09:33:34,874 INFO [train.py:1031] (3/4) Epoch 5, batch 13000, loss[loss=0.2156, simple_loss=0.3051, pruned_loss=0.06303, over 16427.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3137, pruned_loss=0.07366, over 32790665.42 frames. ], batch size: 50, lr: 7.36e-03, grad_scale: 32.0 2023-10-10 09:33:47,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.99 vs. limit=6.0 2023-10-10 09:33:47,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=315606.6666666667, ans=0.0 2023-10-10 09:33:50,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=315606.6666666667, ans=0.125 2023-10-10 09:33:52,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=315606.6666666667, ans=0.0 2023-10-10 09:34:03,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.95 vs. limit=10.0 2023-10-10 09:34:19,715 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.805e+02 1.942e+02 2.223e+02 3.286e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 09:34:22,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=315700.0, ans=0.125 2023-10-10 09:34:32,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=315746.6666666667, ans=0.2 2023-10-10 09:34:49,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=315840.0, ans=0.04949747468305833 2023-10-10 09:35:44,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=316073.3333333333, ans=0.09899494936611666 2023-10-10 09:35:55,750 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-10-10 09:35:59,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=316120.0, ans=0.0 2023-10-10 09:36:00,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=12.0 2023-10-10 09:36:10,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.786e+02 2.080e+02 2.370e+02 3.823e+02, threshold=4.159e+02, percent-clipped=0.0 2023-10-10 09:36:23,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2023-10-10 09:36:36,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.60 vs. limit=10.0 2023-10-10 09:36:45,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=22.5 2023-10-10 09:36:45,837 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:36:55,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=316353.3333333333, ans=0.125 2023-10-10 09:37:03,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=316400.0, ans=0.125 2023-10-10 09:37:07,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=316400.0, ans=0.125 2023-10-10 09:37:25,227 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:37:46,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=316586.6666666667, ans=0.2 2023-10-10 09:37:59,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.52 vs. limit=22.5 2023-10-10 09:38:04,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.704e+02 1.903e+02 2.308e+02 2.940e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-10 09:38:07,953 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:38:16,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-10-10 09:39:04,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=316866.6666666667, ans=0.0 2023-10-10 09:39:05,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=316913.3333333333, ans=0.125 2023-10-10 09:39:26,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=316960.0, ans=0.0 2023-10-10 09:39:44,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=317053.3333333333, ans=0.2 2023-10-10 09:39:57,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.923e+02 2.163e+02 2.420e+02 3.729e+02, threshold=4.325e+02, percent-clipped=0.0 2023-10-10 09:40:02,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=317146.6666666667, ans=0.2 2023-10-10 09:40:22,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-10-10 09:40:25,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=317240.0, ans=0.04949747468305833 2023-10-10 09:40:27,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=317240.0, ans=0.1 2023-10-10 09:40:45,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=317333.3333333333, ans=0.2 2023-10-10 09:40:49,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.43 vs. limit=15.0 2023-10-10 09:40:52,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=317333.3333333333, ans=0.125 2023-10-10 09:40:53,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=317333.3333333333, ans=0.125 2023-10-10 09:40:57,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=317380.0, ans=10.0 2023-10-10 09:41:06,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=317426.6666666667, ans=0.125 2023-10-10 09:41:08,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=317426.6666666667, ans=0.125 2023-10-10 09:41:09,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=317426.6666666667, ans=0.125 2023-10-10 09:41:27,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317473.3333333333, ans=0.1 2023-10-10 09:41:33,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=317520.0, ans=0.125 2023-10-10 09:41:41,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=317566.6666666667, ans=0.0 2023-10-10 09:41:45,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=317566.6666666667, ans=0.125 2023-10-10 09:41:46,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.829e+02 2.127e+02 2.596e+02 4.066e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-10 09:41:50,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=317613.3333333333, ans=0.125 2023-10-10 09:42:01,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=317660.0, ans=0.125 2023-10-10 09:42:02,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=317660.0, ans=0.125 2023-10-10 09:42:03,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317660.0, ans=0.1 2023-10-10 09:42:04,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-10-10 09:42:12,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=317706.6666666667, ans=0.0 2023-10-10 09:42:26,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=317753.3333333333, ans=0.125 2023-10-10 09:42:35,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.17 vs. limit=22.5 2023-10-10 09:42:41,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=317846.6666666667, ans=0.0 2023-10-10 09:42:50,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=317846.6666666667, ans=0.125 2023-10-10 09:42:52,200 INFO [train.py:1031] (3/4) Epoch 5, batch 13500, loss[loss=0.2502, simple_loss=0.3361, pruned_loss=0.08212, over 16569.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3129, pruned_loss=0.07333, over 32812814.75 frames. ], batch size: 219, lr: 7.33e-03, grad_scale: 32.0 2023-10-10 09:42:52,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=317893.3333333333, ans=0.125 2023-10-10 09:42:54,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.35 vs. limit=15.0 2023-10-10 09:43:05,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=317940.0, ans=0.0 2023-10-10 09:43:26,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=318033.3333333333, ans=0.125 2023-10-10 09:43:31,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.700e+02 1.878e+02 2.171e+02 3.207e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-10 09:43:41,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=318080.0, ans=0.125 2023-10-10 09:43:49,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=318126.6666666667, ans=0.0 2023-10-10 09:43:52,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318126.6666666667, ans=0.1 2023-10-10 09:43:54,893 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:43:59,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=318173.3333333333, ans=0.0 2023-10-10 09:44:02,489 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2023-10-10 09:44:06,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=318220.0, ans=0.0 2023-10-10 09:44:21,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.19 vs. limit=15.0 2023-10-10 09:44:39,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.67 vs. limit=22.5 2023-10-10 09:44:46,534 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-10-10 09:44:56,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=318406.6666666667, ans=0.125 2023-10-10 09:45:02,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.57 vs. limit=10.0 2023-10-10 09:45:14,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.804e+02 1.958e+02 2.191e+02 3.104e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-10 09:45:24,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-10-10 09:45:24,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=318546.6666666667, ans=22.5 2023-10-10 09:45:30,800 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:45:31,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.26 vs. limit=15.0 2023-10-10 09:45:59,552 INFO [train.py:1031] (3/4) Epoch 6, batch 0, loss[loss=0.1826, simple_loss=0.275, pruned_loss=0.04505, over 16898.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.275, pruned_loss=0.04505, over 16898.00 frames. ], batch size: 138, lr: 6.59e-03, grad_scale: 32.0 2023-10-10 09:45:59,553 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-10 09:46:07,217 INFO [train.py:1063] (3/4) Epoch 6, validation: loss=0.2342, simple_loss=0.321, pruned_loss=0.07365, over 1020973.00 frames. 2023-10-10 09:46:07,218 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-10 09:46:25,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=318663.3333333333, ans=0.125 2023-10-10 09:46:36,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=318710.0, ans=0.125 2023-10-10 09:46:43,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=318756.6666666667, ans=0.0 2023-10-10 09:46:53,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318803.3333333333, ans=0.1 2023-10-10 09:46:58,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=318803.3333333333, ans=0.0 2023-10-10 09:47:05,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=318850.0, ans=0.0 2023-10-10 09:47:27,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=318943.3333333333, ans=0.035 2023-10-10 09:47:27,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318943.3333333333, ans=0.1 2023-10-10 09:47:32,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=318943.3333333333, ans=0.2 2023-10-10 09:47:40,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.657e+02 1.844e+02 2.142e+02 3.757e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-10 09:47:43,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=318990.0, ans=0.125 2023-10-10 09:47:48,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=319036.6666666667, ans=0.0 2023-10-10 09:47:55,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=319036.6666666667, ans=0.125 2023-10-10 09:47:58,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=319036.6666666667, ans=0.125 2023-10-10 09:48:04,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=319083.3333333333, ans=0.2 2023-10-10 09:48:09,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.04 vs. limit=6.0 2023-10-10 09:48:44,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=319270.0, ans=0.125 2023-10-10 09:48:57,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=319316.6666666667, ans=0.2 2023-10-10 09:49:02,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=319316.6666666667, ans=0.125 2023-10-10 09:49:14,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-10-10 09:49:27,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.653e+02 1.834e+02 2.114e+02 3.524e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 09:49:45,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=319503.3333333333, ans=0.0 2023-10-10 09:49:47,852 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:49:51,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=319550.0, ans=0.125 2023-10-10 09:49:56,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-10-10 09:49:58,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=319596.6666666667, ans=0.07 2023-10-10 09:50:02,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=319596.6666666667, ans=0.125 2023-10-10 09:50:07,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=319643.3333333333, ans=0.2 2023-10-10 09:50:15,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=319643.3333333333, ans=0.125 2023-10-10 09:50:25,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2023-10-10 09:50:28,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319690.0, ans=0.1 2023-10-10 09:50:47,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=319783.3333333333, ans=0.125 2023-10-10 09:50:49,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-10-10 09:50:56,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=319783.3333333333, ans=0.125 2023-10-10 09:51:02,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=319830.0, ans=0.0 2023-10-10 09:51:19,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.707e+02 1.930e+02 2.182e+02 3.631e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-10 09:51:23,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=319923.3333333333, ans=0.125 2023-10-10 09:51:42,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=320016.6666666667, ans=0.0 2023-10-10 09:51:48,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=320016.6666666667, ans=0.2 2023-10-10 09:51:50,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=320063.3333333333, ans=0.125 2023-10-10 09:51:58,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320063.3333333333, ans=0.125 2023-10-10 09:52:04,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=320110.0, ans=0.125 2023-10-10 09:52:15,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=22.5 2023-10-10 09:52:22,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=320156.6666666667, ans=0.07 2023-10-10 09:52:29,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320203.3333333333, ans=0.1 2023-10-10 09:52:36,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=320250.0, ans=0.125 2023-10-10 09:52:58,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320343.3333333333, ans=0.1 2023-10-10 09:53:02,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2023-10-10 09:53:06,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.707e+02 1.948e+02 2.239e+02 3.046e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 09:53:11,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320390.0, ans=0.0 2023-10-10 09:53:11,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=320390.0, ans=0.0 2023-10-10 09:53:13,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.46 vs. limit=22.5 2023-10-10 09:53:17,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=320436.6666666667, ans=0.0 2023-10-10 09:53:20,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=320436.6666666667, ans=0.2 2023-10-10 09:53:45,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320530.0, ans=0.1 2023-10-10 09:54:32,277 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:54:56,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2023-10-10 09:54:57,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=320856.6666666667, ans=0.95 2023-10-10 09:54:58,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.721e+02 1.871e+02 1.985e+02 2.940e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-10 09:55:04,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=320856.6666666667, ans=0.125 2023-10-10 09:55:10,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=320903.3333333333, ans=0.0 2023-10-10 09:55:21,733 INFO [train.py:1031] (3/4) Epoch 6, batch 500, loss[loss=0.2141, simple_loss=0.2993, pruned_loss=0.06443, over 16500.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3115, pruned_loss=0.07209, over 7318585.81 frames. ], batch size: 50, lr: 6.56e-03, grad_scale: 32.0 2023-10-10 09:55:21,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=320950.0, ans=0.0 2023-10-10 09:55:23,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=320950.0, ans=0.125 2023-10-10 09:55:26,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-10-10 09:55:30,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=15.0 2023-10-10 09:55:31,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-10-10 09:55:32,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=320996.6666666667, ans=0.125 2023-10-10 09:56:09,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-10-10 09:56:11,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=321136.6666666667, ans=0.125 2023-10-10 09:56:23,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2023-10-10 09:56:42,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=321276.6666666667, ans=0.125 2023-10-10 09:56:49,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.832e+02 1.980e+02 2.406e+02 3.202e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-10 09:57:00,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321370.0, ans=0.1 2023-10-10 09:57:05,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=321370.0, ans=0.125 2023-10-10 09:57:20,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=321463.3333333333, ans=0.025 2023-10-10 09:57:24,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.36 vs. limit=22.5 2023-10-10 09:57:26,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=321463.3333333333, ans=0.0 2023-10-10 09:57:28,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=321463.3333333333, ans=0.2 2023-10-10 09:57:56,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=321603.3333333333, ans=0.125 2023-10-10 09:57:56,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=321603.3333333333, ans=0.5 2023-10-10 09:58:02,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=321650.0, ans=0.0 2023-10-10 09:58:16,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=321696.6666666667, ans=0.125 2023-10-10 09:58:17,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=321696.6666666667, ans=0.0 2023-10-10 09:58:36,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-10-10 09:58:36,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.785e+02 2.073e+02 2.409e+02 3.539e+02, threshold=4.145e+02, percent-clipped=0.0 2023-10-10 09:58:41,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=15.0 2023-10-10 09:58:49,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.15 vs. limit=22.5 2023-10-10 09:58:57,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2023-10-10 09:59:30,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322023.3333333333, ans=0.1 2023-10-10 09:59:31,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-10-10 09:59:54,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322116.6666666667, ans=0.1 2023-10-10 09:59:59,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=322116.6666666667, ans=0.0 2023-10-10 10:00:00,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=322116.6666666667, ans=0.0 2023-10-10 10:00:10,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=322163.3333333333, ans=0.0 2023-10-10 10:00:25,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.766e+02 2.044e+02 2.347e+02 3.652e+02, threshold=4.089e+02, percent-clipped=0.0 2023-10-10 10:00:32,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=322256.6666666667, ans=0.125 2023-10-10 10:00:51,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=322350.0, ans=0.07 2023-10-10 10:01:04,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=322396.6666666667, ans=0.125 2023-10-10 10:01:04,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=322396.6666666667, ans=0.04949747468305833 2023-10-10 10:01:22,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=322490.0, ans=0.125 2023-10-10 10:01:24,183 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:01:34,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=322490.0, ans=0.125 2023-10-10 10:01:50,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-10-10 10:02:01,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-10-10 10:02:23,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.734e+02 1.942e+02 2.225e+02 3.170e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 10:02:40,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=322770.0, ans=0.125 2023-10-10 10:02:43,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=322816.6666666667, ans=0.025 2023-10-10 10:03:16,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=322956.6666666667, ans=0.1 2023-10-10 10:03:20,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=322956.6666666667, ans=0.2 2023-10-10 10:03:25,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=322956.6666666667, ans=0.125 2023-10-10 10:03:39,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323003.3333333333, ans=0.1 2023-10-10 10:03:46,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=12.0 2023-10-10 10:04:07,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-10-10 10:04:13,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.63 vs. limit=10.0 2023-10-10 10:04:13,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=323143.3333333333, ans=0.035 2023-10-10 10:04:16,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.682e+02 1.886e+02 2.092e+02 3.152e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-10 10:04:37,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=323236.6666666667, ans=0.05 2023-10-10 10:04:37,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=323236.6666666667, ans=0.05 2023-10-10 10:04:39,072 INFO [train.py:1031] (3/4) Epoch 6, batch 1000, loss[loss=0.2037, simple_loss=0.2976, pruned_loss=0.05485, over 16820.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3123, pruned_loss=0.07286, over 12954658.80 frames. ], batch size: 98, lr: 6.54e-03, grad_scale: 32.0 2023-10-10 10:04:39,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=323283.3333333333, ans=0.1 2023-10-10 10:04:40,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=323283.3333333333, ans=0.02 2023-10-10 10:04:45,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.43 vs. limit=22.5 2023-10-10 10:04:48,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=323283.3333333333, ans=0.125 2023-10-10 10:04:49,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=323330.0, ans=0.125 2023-10-10 10:04:50,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=323330.0, ans=0.125 2023-10-10 10:05:04,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=323376.6666666667, ans=0.0 2023-10-10 10:05:05,628 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-10 10:05:06,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=323376.6666666667, ans=0.125 2023-10-10 10:05:11,666 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:05:30,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323516.6666666667, ans=0.1 2023-10-10 10:05:38,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=323516.6666666667, ans=0.1 2023-10-10 10:05:49,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=323563.3333333333, ans=0.2 2023-10-10 10:05:49,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=323563.3333333333, ans=0.125 2023-10-10 10:05:52,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=323610.0, ans=0.95 2023-10-10 10:06:01,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=323610.0, ans=0.1 2023-10-10 10:06:04,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.757e+02 2.017e+02 2.290e+02 3.041e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-10 10:06:15,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=323703.3333333333, ans=0.125 2023-10-10 10:06:31,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=323750.0, ans=0.125 2023-10-10 10:06:38,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-10 10:06:53,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=323843.3333333333, ans=0.2 2023-10-10 10:07:01,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=323843.3333333333, ans=0.0 2023-10-10 10:07:03,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2023-10-10 10:07:15,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=22.5 2023-10-10 10:07:21,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=323936.6666666667, ans=0.0 2023-10-10 10:07:27,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=323983.3333333333, ans=0.125 2023-10-10 10:07:27,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323983.3333333333, ans=0.1 2023-10-10 10:07:49,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=324030.0, ans=0.125 2023-10-10 10:07:49,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=324030.0, ans=0.125 2023-10-10 10:08:03,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-10-10 10:08:05,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.739e+02 1.896e+02 2.112e+02 3.425e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-10 10:08:08,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.86 vs. limit=15.0 2023-10-10 10:08:10,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=324123.3333333333, ans=0.125 2023-10-10 10:08:39,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=324263.3333333333, ans=0.0 2023-10-10 10:08:41,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=324263.3333333333, ans=0.0 2023-10-10 10:08:42,130 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:08:53,372 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:08:59,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-10-10 10:09:02,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=324356.6666666667, ans=0.125 2023-10-10 10:09:04,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324356.6666666667, ans=0.1 2023-10-10 10:09:09,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=324403.3333333333, ans=0.0 2023-10-10 10:09:12,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=324403.3333333333, ans=0.125 2023-10-10 10:09:29,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=324496.6666666667, ans=0.125 2023-10-10 10:09:39,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=324543.3333333333, ans=0.0 2023-10-10 10:09:44,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-10 10:09:51,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.673e+02 1.943e+02 2.176e+02 2.989e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 10:09:54,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2023-10-10 10:09:58,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=324636.6666666667, ans=0.0 2023-10-10 10:10:08,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=324636.6666666667, ans=0.0 2023-10-10 10:10:11,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.49 vs. limit=10.0 2023-10-10 10:10:37,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=324776.6666666667, ans=0.0 2023-10-10 10:10:48,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=324823.3333333333, ans=0.125 2023-10-10 10:10:55,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=324870.0, ans=0.0 2023-10-10 10:10:56,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=324870.0, ans=0.125 2023-10-10 10:11:03,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=324870.0, ans=0.125 2023-10-10 10:11:11,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=324916.6666666667, ans=0.07 2023-10-10 10:11:25,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=324963.3333333333, ans=0.125 2023-10-10 10:11:33,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=325010.0, ans=0.0 2023-10-10 10:11:41,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.687e+02 1.892e+02 2.269e+02 3.526e+02, threshold=3.783e+02, percent-clipped=0.0 2023-10-10 10:11:44,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=325056.6666666667, ans=0.125 2023-10-10 10:11:49,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=325103.3333333333, ans=0.0 2023-10-10 10:11:50,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325103.3333333333, ans=0.125 2023-10-10 10:11:56,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=325103.3333333333, ans=0.0 2023-10-10 10:12:08,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=325150.0, ans=0.0 2023-10-10 10:12:21,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325243.3333333333, ans=0.1 2023-10-10 10:12:32,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=325243.3333333333, ans=0.125 2023-10-10 10:12:45,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=325336.6666666667, ans=0.0 2023-10-10 10:12:54,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325336.6666666667, ans=0.125 2023-10-10 10:13:08,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325430.0, ans=0.1 2023-10-10 10:13:14,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=325430.0, ans=0.04949747468305833 2023-10-10 10:13:19,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=325476.6666666667, ans=0.0 2023-10-10 10:13:32,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.812e+02 2.015e+02 2.234e+02 3.774e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-10 10:13:33,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=325523.3333333333, ans=0.0 2023-10-10 10:13:45,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=325570.0, ans=0.2 2023-10-10 10:13:56,563 INFO [train.py:1031] (3/4) Epoch 6, batch 1500, loss[loss=0.2033, simple_loss=0.2915, pruned_loss=0.05759, over 16909.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3099, pruned_loss=0.0717, over 17325390.23 frames. ], batch size: 130, lr: 6.51e-03, grad_scale: 32.0 2023-10-10 10:13:58,178 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:14:00,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=325616.6666666667, ans=0.07 2023-10-10 10:14:04,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=325616.6666666667, ans=0.0 2023-10-10 10:14:20,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-10-10 10:14:23,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=325710.0, ans=0.0 2023-10-10 10:14:25,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=325710.0, ans=0.125 2023-10-10 10:14:37,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=325756.6666666667, ans=0.125 2023-10-10 10:14:57,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=325850.0, ans=0.125 2023-10-10 10:15:02,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=325896.6666666667, ans=0.0 2023-10-10 10:15:05,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=325896.6666666667, ans=0.0 2023-10-10 10:15:20,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.95 vs. limit=22.5 2023-10-10 10:15:26,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.253e+02 1.722e+02 1.938e+02 2.277e+02 3.511e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-10 10:15:38,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=15.0 2023-10-10 10:15:41,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=326036.6666666667, ans=0.125 2023-10-10 10:16:07,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-10-10 10:16:14,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=326176.6666666667, ans=15.0 2023-10-10 10:16:20,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=326223.3333333333, ans=0.2 2023-10-10 10:16:28,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=326223.3333333333, ans=0.2 2023-10-10 10:16:50,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=326316.6666666667, ans=10.0 2023-10-10 10:17:03,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=326363.3333333333, ans=0.125 2023-10-10 10:17:20,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.691e+02 1.880e+02 2.078e+02 2.981e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-10 10:17:22,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-10-10 10:17:34,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=326503.3333333333, ans=0.0 2023-10-10 10:17:53,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=326596.6666666667, ans=0.125 2023-10-10 10:17:57,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=326596.6666666667, ans=0.125 2023-10-10 10:18:29,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-10-10 10:18:31,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.55 vs. limit=15.0 2023-10-10 10:18:44,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=326830.0, ans=0.125 2023-10-10 10:18:47,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=326830.0, ans=0.0 2023-10-10 10:19:01,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=326876.6666666667, ans=0.5 2023-10-10 10:19:06,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.736e+02 1.903e+02 2.095e+02 2.601e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 10:20:03,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=327156.6666666667, ans=0.5 2023-10-10 10:20:11,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=327156.6666666667, ans=0.125 2023-10-10 10:20:39,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=327296.6666666667, ans=0.125 2023-10-10 10:20:51,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=327343.3333333333, ans=0.125 2023-10-10 10:20:58,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.676e+02 1.844e+02 2.137e+02 2.875e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 10:21:18,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327483.3333333333, ans=0.125 2023-10-10 10:21:31,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=327530.0, ans=0.125 2023-10-10 10:21:45,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=327576.6666666667, ans=0.0 2023-10-10 10:21:46,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=327576.6666666667, ans=0.2 2023-10-10 10:22:01,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=327623.3333333333, ans=0.125 2023-10-10 10:22:04,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327670.0, ans=0.125 2023-10-10 10:22:22,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-10 10:22:23,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=327716.6666666667, ans=0.07 2023-10-10 10:22:58,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.663e+02 1.822e+02 2.048e+02 2.846e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-10 10:23:01,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.97 vs. limit=22.5 2023-10-10 10:23:04,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-10 10:23:06,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=327856.6666666667, ans=0.0 2023-10-10 10:23:12,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327903.3333333333, ans=0.1 2023-10-10 10:23:16,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=327903.3333333333, ans=0.125 2023-10-10 10:23:19,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=327903.3333333333, ans=0.125 2023-10-10 10:23:20,909 INFO [train.py:1031] (3/4) Epoch 6, batch 2000, loss[loss=0.244, simple_loss=0.3158, pruned_loss=0.08611, over 15674.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3107, pruned_loss=0.07174, over 20772557.09 frames. ], batch size: 350, lr: 6.49e-03, grad_scale: 32.0 2023-10-10 10:23:24,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-10-10 10:23:55,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328043.3333333333, ans=0.125 2023-10-10 10:24:11,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-10-10 10:24:22,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=328136.6666666667, ans=0.0 2023-10-10 10:24:24,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=328136.6666666667, ans=0.125 2023-10-10 10:24:30,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=328183.3333333333, ans=0.125 2023-10-10 10:24:54,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-10-10 10:25:01,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=328323.3333333333, ans=0.125 2023-10-10 10:25:02,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.692e+02 1.876e+02 2.056e+02 3.072e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-10 10:25:23,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=328416.6666666667, ans=0.125 2023-10-10 10:25:25,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=328416.6666666667, ans=0.125 2023-10-10 10:25:34,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.12 vs. limit=15.0 2023-10-10 10:26:42,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=328650.0, ans=0.125 2023-10-10 10:26:44,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=328650.0, ans=0.0 2023-10-10 10:26:49,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=328650.0, ans=0.125 2023-10-10 10:27:01,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=328743.3333333333, ans=0.04949747468305833 2023-10-10 10:27:13,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=328790.0, ans=0.1 2023-10-10 10:27:16,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.761e+02 1.994e+02 2.373e+02 3.351e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-10 10:27:20,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-10-10 10:27:40,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=15.0 2023-10-10 10:27:54,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2023-10-10 10:27:57,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=328976.6666666667, ans=0.125 2023-10-10 10:28:12,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.43 vs. limit=10.0 2023-10-10 10:28:18,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=329023.3333333333, ans=0.125 2023-10-10 10:28:25,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=329070.0, ans=0.1 2023-10-10 10:29:06,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=329256.6666666667, ans=0.125 2023-10-10 10:29:06,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.774e+02 1.956e+02 2.277e+02 3.393e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-10 10:29:35,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.51 vs. limit=10.0 2023-10-10 10:30:32,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=329630.0, ans=0.125 2023-10-10 10:30:42,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=329676.6666666667, ans=0.0 2023-10-10 10:30:55,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.775e+02 1.973e+02 2.192e+02 3.529e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-10 10:30:58,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=329723.3333333333, ans=0.125 2023-10-10 10:31:09,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=329770.0, ans=0.2 2023-10-10 10:31:12,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-10-10 10:31:15,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=329816.6666666667, ans=0.125 2023-10-10 10:31:49,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-10-10 10:32:06,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=330050.0, ans=0.125 2023-10-10 10:32:15,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330096.6666666667, ans=0.1 2023-10-10 10:32:17,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.83 vs. limit=22.5 2023-10-10 10:32:36,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=330190.0, ans=0.0 2023-10-10 10:32:39,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.789e+02 1.929e+02 2.145e+02 3.593e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-10 10:32:49,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-10-10 10:32:49,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=330236.6666666667, ans=0.0 2023-10-10 10:32:57,395 INFO [train.py:1031] (3/4) Epoch 6, batch 2500, loss[loss=0.2361, simple_loss=0.3234, pruned_loss=0.07445, over 16891.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3104, pruned_loss=0.07153, over 23440489.04 frames. ], batch size: 77, lr: 6.47e-03, grad_scale: 32.0 2023-10-10 10:33:01,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-10-10 10:33:23,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330376.6666666667, ans=0.125 2023-10-10 10:33:38,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330470.0, ans=0.1 2023-10-10 10:33:44,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=330470.0, ans=0.0 2023-10-10 10:33:45,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-10-10 10:33:48,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=330516.6666666667, ans=0.125 2023-10-10 10:33:55,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=330516.6666666667, ans=0.04949747468305833 2023-10-10 10:34:01,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-10-10 10:34:02,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2023-10-10 10:34:17,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=330610.0, ans=0.0 2023-10-10 10:34:20,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=330656.6666666667, ans=0.025 2023-10-10 10:34:23,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.814e+02 1.991e+02 2.240e+02 3.669e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-10 10:34:23,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=330656.6666666667, ans=0.125 2023-10-10 10:34:31,862 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:34:33,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=330703.3333333333, ans=0.05 2023-10-10 10:34:52,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.60 vs. limit=22.5 2023-10-10 10:34:55,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=330796.6666666667, ans=0.125 2023-10-10 10:35:08,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330843.3333333333, ans=0.1 2023-10-10 10:35:28,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=330936.6666666667, ans=0.125 2023-10-10 10:35:40,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330983.3333333333, ans=0.1 2023-10-10 10:35:41,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=330983.3333333333, ans=0.125 2023-10-10 10:35:54,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=331076.6666666667, ans=0.0 2023-10-10 10:36:09,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.738e+02 2.130e+02 2.615e+02 4.705e+02, threshold=4.259e+02, percent-clipped=1.0 2023-10-10 10:36:10,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=331123.3333333333, ans=0.0 2023-10-10 10:36:30,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=331170.0, ans=0.125 2023-10-10 10:36:39,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=331216.6666666667, ans=0.0 2023-10-10 10:36:53,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.28 vs. limit=10.0 2023-10-10 10:37:03,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=331310.0, ans=0.2 2023-10-10 10:37:18,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=331356.6666666667, ans=0.02 2023-10-10 10:37:37,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=331450.0, ans=0.025 2023-10-10 10:37:52,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=331496.6666666667, ans=0.125 2023-10-10 10:37:58,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=331543.3333333333, ans=0.5 2023-10-10 10:38:14,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.723e+02 1.916e+02 2.197e+02 3.578e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-10 10:38:19,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=331590.0, ans=0.125 2023-10-10 10:38:32,542 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:38:32,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=331636.6666666667, ans=0.125 2023-10-10 10:38:37,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-10-10 10:38:58,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=331776.6666666667, ans=0.125 2023-10-10 10:39:04,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=331776.6666666667, ans=0.125 2023-10-10 10:39:16,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-10-10 10:39:19,202 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:39:37,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331916.6666666667, ans=0.1 2023-10-10 10:40:05,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=332010.0, ans=0.125 2023-10-10 10:40:16,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.671e+02 1.891e+02 2.249e+02 3.716e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-10 10:40:54,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=332196.6666666667, ans=0.0 2023-10-10 10:40:56,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=332196.6666666667, ans=0.02 2023-10-10 10:40:59,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2023-10-10 10:41:04,160 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.29 vs. limit=15.0 2023-10-10 10:41:08,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=332243.3333333333, ans=0.125 2023-10-10 10:41:22,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=332290.0, ans=0.0 2023-10-10 10:41:38,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2023-10-10 10:41:39,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332383.3333333333, ans=0.1 2023-10-10 10:41:41,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=332383.3333333333, ans=0.125 2023-10-10 10:41:43,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-10-10 10:41:50,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332430.0, ans=0.1 2023-10-10 10:41:58,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332476.6666666667, ans=0.1 2023-10-10 10:42:03,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=332476.6666666667, ans=0.1 2023-10-10 10:42:04,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.44 vs. limit=10.0 2023-10-10 10:42:11,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=12.0 2023-10-10 10:42:11,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=332523.3333333333, ans=10.0 2023-10-10 10:42:12,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.686e+02 1.977e+02 2.244e+02 2.741e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-10 10:42:13,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.45 vs. limit=22.5 2023-10-10 10:42:22,384 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:42:30,905 INFO [train.py:1031] (3/4) Epoch 6, batch 3000, loss[loss=0.2043, simple_loss=0.285, pruned_loss=0.06181, over 15943.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3096, pruned_loss=0.0716, over 25488201.87 frames. ], batch size: 43, lr: 6.45e-03, grad_scale: 32.0 2023-10-10 10:43:00,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=332756.6666666667, ans=0.2 2023-10-10 10:43:03,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=332756.6666666667, ans=0.125 2023-10-10 10:43:07,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=332756.6666666667, ans=0.2 2023-10-10 10:43:12,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=332803.3333333333, ans=0.125 2023-10-10 10:43:28,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=332850.0, ans=0.04949747468305833 2023-10-10 10:43:56,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.708e+02 1.886e+02 2.047e+02 3.045e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-10 10:44:17,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333083.3333333333, ans=0.1 2023-10-10 10:44:27,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.67 vs. limit=15.0 2023-10-10 10:44:31,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.66 vs. limit=15.0 2023-10-10 10:44:39,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333130.0, ans=0.1 2023-10-10 10:44:42,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333176.6666666667, ans=0.125 2023-10-10 10:44:43,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=333176.6666666667, ans=0.0 2023-10-10 10:45:05,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=333270.0, ans=0.125 2023-10-10 10:45:19,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=333316.6666666667, ans=0.0 2023-10-10 10:45:22,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=333316.6666666667, ans=0.0 2023-10-10 10:45:48,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.670e+02 1.874e+02 2.211e+02 3.569e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-10 10:45:56,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=333503.3333333333, ans=0.125 2023-10-10 10:45:57,270 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.054e-02 2023-10-10 10:46:01,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=333503.3333333333, ans=0.2 2023-10-10 10:46:02,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-10 10:46:17,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=333596.6666666667, ans=0.1 2023-10-10 10:46:22,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333596.6666666667, ans=0.1 2023-10-10 10:46:25,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333596.6666666667, ans=0.1 2023-10-10 10:46:37,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=333643.3333333333, ans=0.125 2023-10-10 10:46:37,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=333643.3333333333, ans=0.0 2023-10-10 10:46:53,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.21 vs. limit=22.5 2023-10-10 10:46:57,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=333736.6666666667, ans=0.125 2023-10-10 10:47:00,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-10-10 10:47:01,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=333736.6666666667, ans=0.125 2023-10-10 10:47:08,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.51 vs. limit=22.5 2023-10-10 10:47:18,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=333783.3333333333, ans=0.2 2023-10-10 10:47:20,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=333783.3333333333, ans=0.125 2023-10-10 10:47:30,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=333830.0, ans=0.0 2023-10-10 10:47:36,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333876.6666666667, ans=0.1 2023-10-10 10:47:41,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=333876.6666666667, ans=0.125 2023-10-10 10:47:50,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.728e+02 1.860e+02 2.157e+02 3.260e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-10 10:48:01,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=333970.0, ans=0.2 2023-10-10 10:48:35,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=334110.0, ans=0.125 2023-10-10 10:48:57,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=334203.3333333333, ans=0.125 2023-10-10 10:49:18,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.79 vs. limit=10.0 2023-10-10 10:49:31,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=334343.3333333333, ans=0.0 2023-10-10 10:49:41,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.695e+02 1.895e+02 2.177e+02 3.583e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-10 10:49:54,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=334436.6666666667, ans=0.125 2023-10-10 10:50:16,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=334530.0, ans=0.0 2023-10-10 10:50:36,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=334623.3333333333, ans=0.0 2023-10-10 10:50:40,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.05 vs. limit=22.5 2023-10-10 10:50:48,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334670.0, ans=0.1 2023-10-10 10:50:54,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=334670.0, ans=0.0 2023-10-10 10:51:16,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=334763.3333333333, ans=0.125 2023-10-10 10:51:16,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.38 vs. limit=22.5 2023-10-10 10:51:24,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.42 vs. limit=15.0 2023-10-10 10:51:31,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=334856.6666666667, ans=0.0 2023-10-10 10:51:36,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.720e+02 1.948e+02 2.139e+02 3.255e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 10:51:40,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=334856.6666666667, ans=0.125 2023-10-10 10:51:44,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=334903.3333333333, ans=0.125 2023-10-10 10:51:47,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=334903.3333333333, ans=0.125 2023-10-10 10:51:54,827 INFO [train.py:1031] (3/4) Epoch 6, batch 3500, loss[loss=0.2171, simple_loss=0.3088, pruned_loss=0.06269, over 16867.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3095, pruned_loss=0.07163, over 27098458.73 frames. ], batch size: 155, lr: 6.42e-03, grad_scale: 16.0 2023-10-10 10:51:58,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=334950.0, ans=0.125 2023-10-10 10:52:00,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=334950.0, ans=0.125 2023-10-10 10:52:01,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=334950.0, ans=0.125 2023-10-10 10:52:01,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=334950.0, ans=0.0 2023-10-10 10:52:21,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=335043.3333333333, ans=0.125 2023-10-10 10:52:26,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=335043.3333333333, ans=0.125 2023-10-10 10:52:26,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=335043.3333333333, ans=0.2 2023-10-10 10:52:28,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.45 vs. limit=15.0 2023-10-10 10:52:32,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=335090.0, ans=0.125 2023-10-10 10:52:41,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=335136.6666666667, ans=0.0 2023-10-10 10:53:24,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335276.6666666667, ans=0.1 2023-10-10 10:53:34,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.712e+02 1.901e+02 2.198e+02 3.386e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 10:53:37,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=335323.3333333333, ans=0.2 2023-10-10 10:54:01,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=335416.6666666667, ans=0.0 2023-10-10 10:54:28,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335556.6666666667, ans=0.1 2023-10-10 10:54:36,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=335556.6666666667, ans=0.125 2023-10-10 10:54:46,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=335603.3333333333, ans=0.0 2023-10-10 10:55:01,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=335696.6666666667, ans=0.09899494936611666 2023-10-10 10:55:19,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.74 vs. limit=6.0 2023-10-10 10:55:23,031 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:55:26,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335790.0, ans=0.1 2023-10-10 10:55:29,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.723e+02 1.919e+02 2.327e+02 3.270e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-10 10:55:34,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-10-10 10:55:42,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=335836.6666666667, ans=0.125 2023-10-10 10:56:05,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=335930.0, ans=10.0 2023-10-10 10:56:10,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=335976.6666666667, ans=0.125 2023-10-10 10:56:35,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.40 vs. limit=15.0 2023-10-10 10:57:11,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-10-10 10:57:20,708 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:57:27,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=336256.6666666667, ans=0.125 2023-10-10 10:57:29,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.659e+02 1.832e+02 2.147e+02 3.516e+02, threshold=3.664e+02, percent-clipped=0.0 2023-10-10 10:57:44,965 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:57:55,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=336350.0, ans=0.2 2023-10-10 10:58:15,616 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:58:59,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=336630.0, ans=0.125 2023-10-10 10:59:07,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=336676.6666666667, ans=0.0 2023-10-10 10:59:12,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.66 vs. limit=15.0 2023-10-10 10:59:19,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=336723.3333333333, ans=0.0 2023-10-10 10:59:21,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.635e+02 1.838e+02 1.989e+02 3.389e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-10 10:59:23,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-10 10:59:39,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=336816.6666666667, ans=0.125 2023-10-10 10:59:45,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=336816.6666666667, ans=15.0 2023-10-10 10:59:47,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=336816.6666666667, ans=0.1 2023-10-10 10:59:47,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-10-10 10:59:54,732 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-10-10 11:00:03,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=336910.0, ans=0.125 2023-10-10 11:00:04,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=336910.0, ans=0.5 2023-10-10 11:00:12,201 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:00:23,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=337003.3333333333, ans=0.125 2023-10-10 11:00:24,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=12.0 2023-10-10 11:00:29,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=337003.3333333333, ans=0.125 2023-10-10 11:00:32,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=337050.0, ans=0.0 2023-10-10 11:00:35,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=337050.0, ans=0.1 2023-10-10 11:00:37,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=337050.0, ans=0.035 2023-10-10 11:00:40,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=337096.6666666667, ans=0.0 2023-10-10 11:00:48,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.05 vs. limit=15.0 2023-10-10 11:00:49,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337096.6666666667, ans=0.1 2023-10-10 11:00:53,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=337143.3333333333, ans=0.125 2023-10-10 11:01:09,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.699e+02 1.853e+02 2.048e+02 2.920e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 11:01:26,850 INFO [train.py:1031] (3/4) Epoch 6, batch 4000, loss[loss=0.2103, simple_loss=0.2998, pruned_loss=0.06038, over 16934.00 frames. ], tot_loss[loss=0.226, simple_loss=0.309, pruned_loss=0.07152, over 28346508.23 frames. ], batch size: 93, lr: 6.40e-03, grad_scale: 32.0 2023-10-10 11:01:59,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=337376.6666666667, ans=10.0 2023-10-10 11:02:02,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=337423.3333333333, ans=0.2 2023-10-10 11:02:14,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=337470.0, ans=0.125 2023-10-10 11:02:19,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=337470.0, ans=0.125 2023-10-10 11:02:21,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=337516.6666666667, ans=0.0 2023-10-10 11:03:00,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.785e+02 1.962e+02 2.190e+02 3.824e+02, threshold=3.923e+02, percent-clipped=1.0 2023-10-10 11:03:15,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=337703.3333333333, ans=0.125 2023-10-10 11:03:25,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.83 vs. limit=15.0 2023-10-10 11:03:27,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=337796.6666666667, ans=0.2 2023-10-10 11:03:27,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=337796.6666666667, ans=0.0 2023-10-10 11:03:36,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=337796.6666666667, ans=0.0 2023-10-10 11:03:47,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=337843.3333333333, ans=0.125 2023-10-10 11:03:49,765 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:03:50,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=337890.0, ans=0.0 2023-10-10 11:03:52,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.04 vs. limit=10.0 2023-10-10 11:04:10,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-10-10 11:04:34,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=338030.0, ans=0.125 2023-10-10 11:05:03,595 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.786e+02 2.007e+02 2.280e+02 3.706e+02, threshold=4.013e+02, percent-clipped=0.0 2023-10-10 11:05:26,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338216.6666666667, ans=0.1 2023-10-10 11:05:26,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338216.6666666667, ans=0.125 2023-10-10 11:05:33,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=338263.3333333333, ans=0.025 2023-10-10 11:05:40,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338310.0, ans=0.1 2023-10-10 11:06:00,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.40 vs. limit=15.0 2023-10-10 11:06:04,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=338403.3333333333, ans=0.0 2023-10-10 11:06:04,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=338403.3333333333, ans=0.125 2023-10-10 11:06:08,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=338403.3333333333, ans=0.125 2023-10-10 11:06:14,077 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:06:25,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338496.6666666667, ans=0.1 2023-10-10 11:06:31,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=338496.6666666667, ans=0.125 2023-10-10 11:06:48,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-10-10 11:06:51,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.711e+02 1.838e+02 2.041e+02 2.638e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-10 11:06:52,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=338590.0, ans=0.0 2023-10-10 11:06:54,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=338590.0, ans=0.0 2023-10-10 11:06:55,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-10-10 11:07:06,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.79 vs. limit=22.5 2023-10-10 11:07:21,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=338730.0, ans=0.0 2023-10-10 11:07:41,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=338823.3333333333, ans=0.125 2023-10-10 11:07:54,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=338870.0, ans=0.125 2023-10-10 11:08:03,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=338916.6666666667, ans=0.125 2023-10-10 11:08:26,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339010.0, ans=0.1 2023-10-10 11:08:34,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=339010.0, ans=0.0 2023-10-10 11:08:42,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.859e+02 2.141e+02 2.428e+02 3.490e+02, threshold=4.281e+02, percent-clipped=0.0 2023-10-10 11:08:59,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=339150.0, ans=0.07 2023-10-10 11:09:04,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=339150.0, ans=0.125 2023-10-10 11:09:15,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=8.0 2023-10-10 11:10:02,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=339336.6666666667, ans=0.1 2023-10-10 11:10:09,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=339383.3333333333, ans=0.07 2023-10-10 11:10:15,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=15.0 2023-10-10 11:10:16,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=339430.0, ans=0.125 2023-10-10 11:10:28,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=339476.6666666667, ans=0.0 2023-10-10 11:10:43,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.701e+02 1.873e+02 2.154e+02 3.012e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-10 11:10:59,737 INFO [train.py:1031] (3/4) Epoch 6, batch 4500, loss[loss=0.2219, simple_loss=0.3115, pruned_loss=0.06616, over 16857.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3092, pruned_loss=0.07117, over 29324679.75 frames. ], batch size: 130, lr: 6.38e-03, grad_scale: 32.0 2023-10-10 11:11:24,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=339710.0, ans=0.0 2023-10-10 11:11:26,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-10 11:11:38,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=339756.6666666667, ans=0.125 2023-10-10 11:11:46,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=339803.3333333333, ans=0.0 2023-10-10 11:12:05,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=339896.6666666667, ans=0.2 2023-10-10 11:12:12,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339896.6666666667, ans=0.125 2023-10-10 11:12:22,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=339943.3333333333, ans=0.0 2023-10-10 11:12:29,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=339990.0, ans=0.05 2023-10-10 11:12:30,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.786e+02 1.964e+02 2.243e+02 3.413e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-10 11:12:36,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=340036.6666666667, ans=0.125 2023-10-10 11:12:41,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=340036.6666666667, ans=0.125 2023-10-10 11:12:41,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=340036.6666666667, ans=0.125 2023-10-10 11:12:51,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=340083.3333333333, ans=0.0 2023-10-10 11:12:57,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=340130.0, ans=0.125 2023-10-10 11:12:58,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=340130.0, ans=0.125 2023-10-10 11:13:06,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=340176.6666666667, ans=0.125 2023-10-10 11:13:22,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=340223.3333333333, ans=0.0 2023-10-10 11:13:28,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=340270.0, ans=0.125 2023-10-10 11:13:43,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=15.0 2023-10-10 11:13:45,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=340316.6666666667, ans=0.0 2023-10-10 11:13:48,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-10 11:13:50,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-10-10 11:14:16,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.817e+02 2.055e+02 2.256e+02 3.453e+02, threshold=4.110e+02, percent-clipped=0.0 2023-10-10 11:14:35,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.48 vs. limit=22.5 2023-10-10 11:14:53,469 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:15:06,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340690.0, ans=0.1 2023-10-10 11:15:18,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=340736.6666666667, ans=0.0 2023-10-10 11:15:30,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=340783.3333333333, ans=0.125 2023-10-10 11:15:31,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=340783.3333333333, ans=0.125 2023-10-10 11:15:51,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=340876.6666666667, ans=0.125 2023-10-10 11:15:57,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=340923.3333333333, ans=0.125 2023-10-10 11:15:58,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.662e+02 1.830e+02 2.207e+02 3.074e+02, threshold=3.660e+02, percent-clipped=0.0 2023-10-10 11:15:58,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=340923.3333333333, ans=0.125 2023-10-10 11:15:58,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=340923.3333333333, ans=0.125 2023-10-10 11:16:11,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=340970.0, ans=0.0 2023-10-10 11:16:12,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=340970.0, ans=0.95 2023-10-10 11:16:28,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=341063.3333333333, ans=0.0 2023-10-10 11:16:34,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=341063.3333333333, ans=0.125 2023-10-10 11:16:36,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=341063.3333333333, ans=0.1 2023-10-10 11:16:38,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=341063.3333333333, ans=0.125 2023-10-10 11:16:39,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=341063.3333333333, ans=0.2 2023-10-10 11:17:19,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.53 vs. limit=15.0 2023-10-10 11:17:25,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=341250.0, ans=0.1 2023-10-10 11:17:30,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=341296.6666666667, ans=0.125 2023-10-10 11:17:45,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=341343.3333333333, ans=0.1 2023-10-10 11:17:51,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=341390.0, ans=0.125 2023-10-10 11:17:54,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.730e+02 1.914e+02 2.194e+02 3.312e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-10 11:18:03,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-10-10 11:18:29,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=341530.0, ans=0.2 2023-10-10 11:18:43,295 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:18:53,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=341623.3333333333, ans=0.0 2023-10-10 11:19:06,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=341670.0, ans=0.2 2023-10-10 11:19:20,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.06 vs. limit=15.0 2023-10-10 11:19:33,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=341810.0, ans=0.125 2023-10-10 11:19:42,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=341856.6666666667, ans=0.125 2023-10-10 11:19:47,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.771e+02 1.959e+02 2.251e+02 3.276e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-10 11:19:56,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=341903.3333333333, ans=15.0 2023-10-10 11:20:00,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=341903.3333333333, ans=0.125 2023-10-10 11:20:04,295 INFO [train.py:1031] (3/4) Epoch 6, batch 5000, loss[loss=0.2073, simple_loss=0.2942, pruned_loss=0.06023, over 15337.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3089, pruned_loss=0.0711, over 30103098.00 frames. ], batch size: 35, lr: 6.36e-03, grad_scale: 32.0 2023-10-10 11:20:33,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=342043.3333333333, ans=0.1 2023-10-10 11:20:44,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=342090.0, ans=0.07 2023-10-10 11:20:54,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=342136.6666666667, ans=0.1 2023-10-10 11:21:16,631 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-10-10 11:21:36,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.848e+02 2.020e+02 2.423e+02 3.330e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-10 11:21:38,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=342323.3333333333, ans=0.0 2023-10-10 11:22:03,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-10-10 11:22:22,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=342510.0, ans=0.2 2023-10-10 11:22:26,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=342556.6666666667, ans=0.0 2023-10-10 11:22:34,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=342556.6666666667, ans=0.125 2023-10-10 11:22:43,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=342603.3333333333, ans=0.04949747468305833 2023-10-10 11:23:01,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=342696.6666666667, ans=0.2 2023-10-10 11:23:13,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=342743.3333333333, ans=0.0 2023-10-10 11:23:14,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=342743.3333333333, ans=0.2 2023-10-10 11:23:15,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-10-10 11:23:16,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=342743.3333333333, ans=0.0 2023-10-10 11:23:24,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=342790.0, ans=0.0 2023-10-10 11:23:27,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.897e+02 2.113e+02 2.407e+02 3.339e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-10 11:23:29,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=342790.0, ans=0.07 2023-10-10 11:23:32,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.61 vs. limit=10.0 2023-10-10 11:23:41,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=342883.3333333333, ans=0.125 2023-10-10 11:23:58,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.06 vs. limit=15.0 2023-10-10 11:24:06,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-10-10 11:24:31,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.58 vs. limit=22.5 2023-10-10 11:24:45,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-10-10 11:24:56,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=343163.3333333333, ans=0.0 2023-10-10 11:25:17,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.688e+02 1.884e+02 2.177e+02 2.925e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-10 11:25:27,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.19 vs. limit=22.5 2023-10-10 11:25:35,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-10 11:25:46,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-10 11:25:47,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=343396.6666666667, ans=0.125 2023-10-10 11:25:55,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=12.0 2023-10-10 11:26:03,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=343443.3333333333, ans=0.125 2023-10-10 11:26:18,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=343490.0, ans=0.125 2023-10-10 11:26:18,981 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:26:35,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-10-10 11:26:41,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=343630.0, ans=0.0 2023-10-10 11:26:46,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=343630.0, ans=0.0 2023-10-10 11:27:08,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.860e+02 2.178e+02 2.540e+02 4.355e+02, threshold=4.355e+02, percent-clipped=1.0 2023-10-10 11:27:14,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=343770.0, ans=0.125 2023-10-10 11:27:15,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343770.0, ans=0.1 2023-10-10 11:27:19,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-10-10 11:27:25,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=343816.6666666667, ans=0.0 2023-10-10 11:27:34,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=343863.3333333333, ans=0.2 2023-10-10 11:27:37,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.63 vs. limit=22.5 2023-10-10 11:27:45,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=343910.0, ans=0.125 2023-10-10 11:28:05,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=344003.3333333333, ans=0.1 2023-10-10 11:28:13,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=344003.3333333333, ans=0.125 2023-10-10 11:28:17,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=344050.0, ans=0.125 2023-10-10 11:28:17,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-10 11:28:21,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=344050.0, ans=0.125 2023-10-10 11:28:22,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-10-10 11:28:24,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=344050.0, ans=0.125 2023-10-10 11:28:26,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=344096.6666666667, ans=0.0 2023-10-10 11:28:53,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.709e+02 1.830e+02 1.991e+02 2.724e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-10 11:29:08,327 INFO [train.py:1031] (3/4) Epoch 6, batch 5500, loss[loss=0.2213, simple_loss=0.3113, pruned_loss=0.06568, over 16897.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3087, pruned_loss=0.07076, over 30724052.37 frames. ], batch size: 116, lr: 6.34e-03, grad_scale: 32.0 2023-10-10 11:29:54,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=344470.0, ans=0.0 2023-10-10 11:29:57,330 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:30:03,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=344516.6666666667, ans=0.125 2023-10-10 11:30:05,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=344516.6666666667, ans=0.0 2023-10-10 11:30:40,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.759e+02 1.903e+02 2.067e+02 2.738e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 11:30:52,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344703.3333333333, ans=0.0 2023-10-10 11:31:00,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344750.0, ans=0.1 2023-10-10 11:31:00,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.13 vs. limit=22.5 2023-10-10 11:31:02,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=344750.0, ans=0.2 2023-10-10 11:31:09,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=344796.6666666667, ans=0.0 2023-10-10 11:31:09,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=344796.6666666667, ans=0.125 2023-10-10 11:31:13,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=344796.6666666667, ans=0.0 2023-10-10 11:31:20,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-10-10 11:31:29,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=344890.0, ans=0.2 2023-10-10 11:31:34,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=344890.0, ans=0.125 2023-10-10 11:32:09,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=345030.0, ans=10.0 2023-10-10 11:32:11,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.09 vs. limit=15.0 2023-10-10 11:32:13,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=345076.6666666667, ans=0.07 2023-10-10 11:32:13,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-10 11:32:25,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=345123.3333333333, ans=0.0 2023-10-10 11:32:32,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.705e+02 1.912e+02 2.140e+02 3.023e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-10 11:32:36,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=345170.0, ans=0.0 2023-10-10 11:33:10,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345263.3333333333, ans=0.1 2023-10-10 11:33:25,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=345356.6666666667, ans=0.125 2023-10-10 11:33:27,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=345356.6666666667, ans=0.0 2023-10-10 11:33:27,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-10-10 11:33:29,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=345356.6666666667, ans=0.125 2023-10-10 11:33:43,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=345403.3333333333, ans=0.125 2023-10-10 11:33:45,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.99 vs. limit=22.5 2023-10-10 11:33:49,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=345450.0, ans=0.2 2023-10-10 11:33:52,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=345450.0, ans=0.2 2023-10-10 11:34:00,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=345496.6666666667, ans=0.04949747468305833 2023-10-10 11:34:05,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=345496.6666666667, ans=0.125 2023-10-10 11:34:06,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=345496.6666666667, ans=0.125 2023-10-10 11:34:12,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=345543.3333333333, ans=0.0 2023-10-10 11:34:28,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.681e+02 1.859e+02 2.178e+02 3.343e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-10 11:34:31,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=345636.6666666667, ans=0.2 2023-10-10 11:34:40,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=345636.6666666667, ans=0.2 2023-10-10 11:34:43,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=345683.3333333333, ans=0.125 2023-10-10 11:34:49,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=345683.3333333333, ans=0.125 2023-10-10 11:34:57,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=345730.0, ans=0.2 2023-10-10 11:35:23,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=15.0 2023-10-10 11:35:34,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.18 vs. limit=22.5 2023-10-10 11:35:44,837 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.85 vs. limit=15.0 2023-10-10 11:35:45,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=345916.6666666667, ans=0.125 2023-10-10 11:35:50,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=15.0 2023-10-10 11:36:00,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=346010.0, ans=0.0 2023-10-10 11:36:11,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=346056.6666666667, ans=0.125 2023-10-10 11:36:13,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=346056.6666666667, ans=0.125 2023-10-10 11:36:19,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.688e+02 1.854e+02 2.078e+02 3.197e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-10 11:36:25,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.06 vs. limit=15.0 2023-10-10 11:36:57,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.93 vs. limit=15.0 2023-10-10 11:37:04,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346243.3333333333, ans=0.1 2023-10-10 11:37:16,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=346290.0, ans=0.125 2023-10-10 11:37:19,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=346290.0, ans=0.09899494936611666 2023-10-10 11:37:27,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-10 11:37:58,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346476.6666666667, ans=0.1 2023-10-10 11:38:09,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.273e+02 1.681e+02 1.803e+02 2.113e+02 3.282e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-10 11:38:17,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=346570.0, ans=0.125 2023-10-10 11:38:25,157 INFO [train.py:1031] (3/4) Epoch 6, batch 6000, loss[loss=0.2395, simple_loss=0.323, pruned_loss=0.07805, over 16882.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3088, pruned_loss=0.07094, over 31178605.60 frames. ], batch size: 67, lr: 6.32e-03, grad_scale: 32.0 2023-10-10 11:38:42,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=346663.3333333333, ans=0.0 2023-10-10 11:38:45,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346663.3333333333, ans=0.1 2023-10-10 11:38:54,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=346710.0, ans=0.125 2023-10-10 11:39:27,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=346850.0, ans=0.125 2023-10-10 11:39:28,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=346850.0, ans=0.2 2023-10-10 11:39:34,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=346896.6666666667, ans=0.125 2023-10-10 11:39:59,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.748e+02 1.895e+02 2.344e+02 4.029e+02, threshold=3.790e+02, percent-clipped=1.0 2023-10-10 11:40:18,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=347083.3333333333, ans=0.0 2023-10-10 11:41:20,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=347363.3333333333, ans=0.5 2023-10-10 11:41:32,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=347410.0, ans=0.025 2023-10-10 11:41:34,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=347410.0, ans=0.1 2023-10-10 11:41:50,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.749e+02 1.951e+02 2.157e+02 3.026e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-10 11:41:55,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=347503.3333333333, ans=0.2 2023-10-10 11:42:07,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=347550.0, ans=0.125 2023-10-10 11:42:09,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=347550.0, ans=0.1 2023-10-10 11:42:11,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.38 vs. limit=15.0 2023-10-10 11:42:11,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=347550.0, ans=0.1 2023-10-10 11:42:49,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347736.6666666667, ans=0.1 2023-10-10 11:43:15,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347830.0, ans=0.0 2023-10-10 11:43:16,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=347830.0, ans=0.125 2023-10-10 11:43:41,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.819e+02 1.996e+02 2.270e+02 2.742e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-10 11:43:44,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=347923.3333333333, ans=0.125 2023-10-10 11:43:53,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=347970.0, ans=0.0 2023-10-10 11:44:10,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=348063.3333333333, ans=0.125 2023-10-10 11:44:13,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=348063.3333333333, ans=0.2 2023-10-10 11:44:17,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=348110.0, ans=0.0 2023-10-10 11:44:30,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.68 vs. limit=22.5 2023-10-10 11:44:47,571 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:44:54,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=15.0 2023-10-10 11:44:55,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.61 vs. limit=15.0 2023-10-10 11:45:00,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348250.0, ans=0.1 2023-10-10 11:45:07,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-10-10 11:45:22,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=348343.3333333333, ans=0.125 2023-10-10 11:45:24,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348343.3333333333, ans=0.1 2023-10-10 11:45:39,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.00 vs. limit=15.0 2023-10-10 11:45:43,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.781e+02 1.963e+02 2.216e+02 3.647e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-10 11:45:55,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=348436.6666666667, ans=0.2 2023-10-10 11:46:01,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=348483.3333333333, ans=0.0 2023-10-10 11:46:02,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=348483.3333333333, ans=0.125 2023-10-10 11:46:20,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348576.6666666667, ans=0.1 2023-10-10 11:46:29,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=348623.3333333333, ans=0.125 2023-10-10 11:46:30,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=348623.3333333333, ans=0.0 2023-10-10 11:46:30,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=348623.3333333333, ans=0.125 2023-10-10 11:46:52,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-10-10 11:47:03,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=348763.3333333333, ans=0.125 2023-10-10 11:47:08,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=348763.3333333333, ans=0.125 2023-10-10 11:47:33,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.754e+02 1.994e+02 2.288e+02 3.082e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-10 11:47:39,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=348903.3333333333, ans=0.125 2023-10-10 11:47:46,356 INFO [train.py:1031] (3/4) Epoch 6, batch 6500, loss[loss=0.236, simple_loss=0.3219, pruned_loss=0.07506, over 16826.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3095, pruned_loss=0.07129, over 31545158.41 frames. ], batch size: 146, lr: 6.30e-03, grad_scale: 32.0 2023-10-10 11:48:08,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348996.6666666667, ans=0.1 2023-10-10 11:48:12,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349043.3333333333, ans=0.1 2023-10-10 11:48:22,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-10-10 11:48:27,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=349090.0, ans=0.125 2023-10-10 11:48:44,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349136.6666666667, ans=0.1 2023-10-10 11:48:46,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=349136.6666666667, ans=0.125 2023-10-10 11:49:01,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=349183.3333333333, ans=0.125 2023-10-10 11:49:05,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349230.0, ans=0.1 2023-10-10 11:49:21,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=349276.6666666667, ans=0.125 2023-10-10 11:49:26,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.41 vs. limit=12.0 2023-10-10 11:49:29,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=349323.3333333333, ans=0.07 2023-10-10 11:49:36,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.685e+02 1.952e+02 2.193e+02 3.527e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-10 11:50:09,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=349463.3333333333, ans=0.0 2023-10-10 11:50:21,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=349510.0, ans=0.125 2023-10-10 11:50:23,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=349556.6666666667, ans=0.2 2023-10-10 11:50:29,027 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:50:33,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=349603.3333333333, ans=0.05 2023-10-10 11:50:33,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.68 vs. limit=22.5 2023-10-10 11:50:45,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=349650.0, ans=0.1 2023-10-10 11:50:53,792 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.96 vs. limit=22.5 2023-10-10 11:51:01,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=349696.6666666667, ans=0.125 2023-10-10 11:51:05,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=349696.6666666667, ans=0.0 2023-10-10 11:51:24,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.782e+02 2.037e+02 2.284e+02 3.707e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 11:51:33,813 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:52:00,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349976.6666666667, ans=0.1 2023-10-10 11:52:09,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=349976.6666666667, ans=0.125 2023-10-10 11:52:16,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=350023.3333333333, ans=0.125 2023-10-10 11:53:17,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.609e+02 1.756e+02 1.958e+02 2.766e+02, threshold=3.512e+02, percent-clipped=0.0 2023-10-10 11:53:22,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=350303.3333333333, ans=0.1 2023-10-10 11:53:27,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-10-10 11:53:30,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=350303.3333333333, ans=0.0 2023-10-10 11:53:41,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350350.0, ans=0.1 2023-10-10 11:53:53,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=350396.6666666667, ans=0.125 2023-10-10 11:54:09,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=350443.3333333333, ans=0.125 2023-10-10 11:54:37,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350536.6666666667, ans=0.1 2023-10-10 11:54:38,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=350536.6666666667, ans=0.0 2023-10-10 11:54:57,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350630.0, ans=0.1 2023-10-10 11:55:17,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=350676.6666666667, ans=0.0 2023-10-10 11:55:22,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-10 11:55:22,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=350723.3333333333, ans=0.125 2023-10-10 11:55:23,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.27 vs. limit=22.5 2023-10-10 11:55:29,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.720e+02 1.935e+02 2.255e+02 3.923e+02, threshold=3.871e+02, percent-clipped=2.0 2023-10-10 11:55:37,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=22.5 2023-10-10 11:55:45,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=350816.6666666667, ans=0.2 2023-10-10 11:55:49,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=350816.6666666667, ans=0.125 2023-10-10 11:55:50,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=350816.6666666667, ans=0.04949747468305833 2023-10-10 11:55:50,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2023-10-10 11:56:03,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=350863.3333333333, ans=0.2 2023-10-10 11:56:20,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=350956.6666666667, ans=0.035 2023-10-10 11:56:23,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=350956.6666666667, ans=0.125 2023-10-10 11:56:28,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=351003.3333333333, ans=0.125 2023-10-10 11:56:42,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=351050.0, ans=0.0 2023-10-10 11:56:45,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=351050.0, ans=0.0 2023-10-10 11:56:46,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=351050.0, ans=0.125 2023-10-10 11:56:52,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=351096.6666666667, ans=0.1 2023-10-10 11:57:14,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.731e+02 2.030e+02 2.454e+02 3.429e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-10 11:57:20,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=351236.6666666667, ans=0.0 2023-10-10 11:57:22,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=351236.6666666667, ans=0.95 2023-10-10 11:57:26,897 INFO [train.py:1031] (3/4) Epoch 6, batch 7000, loss[loss=0.233, simple_loss=0.3181, pruned_loss=0.07398, over 16606.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3097, pruned_loss=0.07104, over 31853845.11 frames. ], batch size: 61, lr: 6.27e-03, grad_scale: 32.0 2023-10-10 11:57:51,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=351376.6666666667, ans=0.125 2023-10-10 11:57:54,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=351376.6666666667, ans=0.0 2023-10-10 11:57:59,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=351376.6666666667, ans=0.125 2023-10-10 11:58:01,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351376.6666666667, ans=0.1 2023-10-10 11:58:08,978 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.91 vs. limit=15.0 2023-10-10 11:58:28,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=351516.6666666667, ans=0.0 2023-10-10 11:58:33,898 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:58:36,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.27 vs. limit=15.0 2023-10-10 11:58:46,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.77 vs. limit=15.0 2023-10-10 11:59:01,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.813e+02 1.991e+02 2.272e+02 3.301e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-10 11:59:08,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=351703.3333333333, ans=0.125 2023-10-10 11:59:11,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.40 vs. limit=22.5 2023-10-10 11:59:12,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.27 vs. limit=15.0 2023-10-10 11:59:14,375 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:59:23,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=351750.0, ans=0.0 2023-10-10 12:00:04,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=351936.6666666667, ans=0.0 2023-10-10 12:00:09,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=351936.6666666667, ans=0.0 2023-10-10 12:00:18,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2023-10-10 12:00:52,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.774e+02 1.945e+02 2.181e+02 3.277e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-10 12:01:29,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-10-10 12:02:13,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=352403.3333333333, ans=0.125 2023-10-10 12:02:15,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=352403.3333333333, ans=0.035 2023-10-10 12:02:34,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=352496.6666666667, ans=0.125 2023-10-10 12:02:50,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=352543.3333333333, ans=0.125 2023-10-10 12:02:54,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=352590.0, ans=0.1 2023-10-10 12:03:02,375 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.675e+02 1.848e+02 2.068e+02 2.833e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-10 12:03:06,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=352636.6666666667, ans=0.2 2023-10-10 12:03:11,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2023-10-10 12:03:18,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=352683.3333333333, ans=10.0 2023-10-10 12:03:18,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2023-10-10 12:03:47,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-10-10 12:03:52,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=352823.3333333333, ans=0.125 2023-10-10 12:04:00,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=352823.3333333333, ans=0.2 2023-10-10 12:04:13,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.28 vs. limit=22.5 2023-10-10 12:04:19,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=352916.6666666667, ans=0.125 2023-10-10 12:04:25,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=352963.3333333333, ans=0.2 2023-10-10 12:04:28,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352963.3333333333, ans=0.1 2023-10-10 12:04:40,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=353010.0, ans=0.2 2023-10-10 12:04:46,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.05 vs. limit=22.5 2023-10-10 12:04:49,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=353056.6666666667, ans=0.125 2023-10-10 12:04:55,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.274e+02 1.677e+02 1.893e+02 2.216e+02 2.791e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-10 12:05:12,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=353150.0, ans=0.0 2023-10-10 12:05:28,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=353196.6666666667, ans=0.125 2023-10-10 12:05:28,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=353196.6666666667, ans=0.2 2023-10-10 12:05:29,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=353196.6666666667, ans=0.125 2023-10-10 12:05:44,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353290.0, ans=0.1 2023-10-10 12:05:48,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=353290.0, ans=0.125 2023-10-10 12:05:50,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-10 12:05:51,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-10-10 12:05:54,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=353336.6666666667, ans=0.125 2023-10-10 12:06:07,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=353383.3333333333, ans=0.1 2023-10-10 12:06:22,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=353476.6666666667, ans=0.0 2023-10-10 12:06:34,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=353523.3333333333, ans=0.125 2023-10-10 12:06:35,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=353523.3333333333, ans=0.125 2023-10-10 12:06:36,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=353523.3333333333, ans=0.125 2023-10-10 12:06:37,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=8.0 2023-10-10 12:06:43,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.755e+02 1.923e+02 2.122e+02 3.396e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-10 12:06:58,845 INFO [train.py:1031] (3/4) Epoch 6, batch 7500, loss[loss=0.2161, simple_loss=0.3028, pruned_loss=0.0647, over 16898.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3095, pruned_loss=0.07121, over 32035331.83 frames. ], batch size: 77, lr: 6.25e-03, grad_scale: 16.0 2023-10-10 12:07:15,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=353663.3333333333, ans=0.09899494936611666 2023-10-10 12:07:27,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=353710.0, ans=0.5 2023-10-10 12:07:29,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=353710.0, ans=0.0 2023-10-10 12:07:31,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=353756.6666666667, ans=0.0 2023-10-10 12:07:34,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-10 12:07:45,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=353803.3333333333, ans=0.125 2023-10-10 12:07:46,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=353803.3333333333, ans=12.0 2023-10-10 12:07:53,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.52 vs. limit=15.0 2023-10-10 12:07:55,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=353850.0, ans=0.125 2023-10-10 12:08:17,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=353943.3333333333, ans=0.2 2023-10-10 12:08:23,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=353943.3333333333, ans=0.0 2023-10-10 12:08:35,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=353990.0, ans=0.2 2023-10-10 12:08:36,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.704e+02 1.940e+02 2.320e+02 4.347e+02, threshold=3.881e+02, percent-clipped=3.0 2023-10-10 12:08:42,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=354036.6666666667, ans=0.0 2023-10-10 12:09:01,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=354130.0, ans=0.125 2023-10-10 12:09:29,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=354223.3333333333, ans=0.0 2023-10-10 12:09:39,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=354223.3333333333, ans=0.0 2023-10-10 12:10:01,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=354316.6666666667, ans=0.125 2023-10-10 12:10:22,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=354410.0, ans=0.0 2023-10-10 12:10:28,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=354456.6666666667, ans=0.125 2023-10-10 12:10:35,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=354456.6666666667, ans=0.125 2023-10-10 12:10:36,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=354456.6666666667, ans=0.0 2023-10-10 12:10:38,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.701e+02 1.962e+02 2.334e+02 3.750e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-10 12:10:57,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=354550.0, ans=0.0 2023-10-10 12:10:59,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=354550.0, ans=0.125 2023-10-10 12:11:01,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-10-10 12:11:04,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-10-10 12:11:10,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=354596.6666666667, ans=0.125 2023-10-10 12:11:30,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=354690.0, ans=0.0 2023-10-10 12:11:42,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.50 vs. limit=15.0 2023-10-10 12:11:44,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=354736.6666666667, ans=0.0 2023-10-10 12:11:46,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-10-10 12:11:50,791 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.65 vs. limit=22.5 2023-10-10 12:12:14,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-10-10 12:12:21,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=354923.3333333333, ans=0.125 2023-10-10 12:12:25,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=354923.3333333333, ans=0.125 2023-10-10 12:12:27,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.754e+02 1.899e+02 2.102e+02 2.940e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-10 12:12:31,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.21 vs. limit=15.0 2023-10-10 12:12:44,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=355016.6666666667, ans=0.125 2023-10-10 12:13:40,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-10 12:14:14,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355390.0, ans=0.1 2023-10-10 12:14:22,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.791e+02 2.064e+02 2.304e+02 3.376e+02, threshold=4.127e+02, percent-clipped=0.0 2023-10-10 12:14:34,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=355436.6666666667, ans=0.125 2023-10-10 12:14:37,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.24 vs. limit=15.0 2023-10-10 12:14:41,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=355483.3333333333, ans=0.1 2023-10-10 12:14:58,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=355530.0, ans=0.125 2023-10-10 12:15:22,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-10-10 12:15:36,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=355716.6666666667, ans=0.0 2023-10-10 12:16:08,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=355810.0, ans=0.125 2023-10-10 12:16:19,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.655e+02 1.902e+02 2.362e+02 3.607e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-10 12:16:28,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-10-10 12:16:32,297 INFO [train.py:1031] (3/4) Epoch 6, batch 8000, loss[loss=0.2222, simple_loss=0.31, pruned_loss=0.06724, over 16957.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3087, pruned_loss=0.07067, over 32189732.02 frames. ], batch size: 110, lr: 6.23e-03, grad_scale: 32.0 2023-10-10 12:16:37,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=355950.0, ans=0.125 2023-10-10 12:16:56,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.51 vs. limit=22.5 2023-10-10 12:17:05,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=356090.0, ans=0.0 2023-10-10 12:17:11,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=356090.0, ans=0.0 2023-10-10 12:17:18,304 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.37 vs. limit=10.0 2023-10-10 12:17:19,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=356136.6666666667, ans=0.125 2023-10-10 12:17:24,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=356136.6666666667, ans=0.0 2023-10-10 12:17:29,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.71 vs. limit=15.0 2023-10-10 12:17:30,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=356183.3333333333, ans=0.0 2023-10-10 12:17:36,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356230.0, ans=0.125 2023-10-10 12:17:40,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=356230.0, ans=0.125 2023-10-10 12:17:41,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=356230.0, ans=0.125 2023-10-10 12:17:42,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-10-10 12:17:42,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=356230.0, ans=0.0 2023-10-10 12:18:06,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.694e+02 1.912e+02 2.221e+02 3.501e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 12:18:10,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=356370.0, ans=0.0 2023-10-10 12:18:17,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=356416.6666666667, ans=0.125 2023-10-10 12:18:18,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=356416.6666666667, ans=0.5 2023-10-10 12:18:19,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=356416.6666666667, ans=0.125 2023-10-10 12:18:52,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=356556.6666666667, ans=0.125 2023-10-10 12:18:56,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356556.6666666667, ans=0.1 2023-10-10 12:19:00,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-10-10 12:19:25,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=356650.0, ans=0.125 2023-10-10 12:19:25,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=356650.0, ans=0.2 2023-10-10 12:19:42,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=356696.6666666667, ans=0.125 2023-10-10 12:19:50,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=356743.3333333333, ans=0.09899494936611666 2023-10-10 12:19:52,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=356743.3333333333, ans=0.0 2023-10-10 12:20:00,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.03 vs. limit=22.5 2023-10-10 12:20:10,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.662e+02 1.858e+02 2.088e+02 3.872e+02, threshold=3.716e+02, percent-clipped=1.0 2023-10-10 12:20:20,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=356836.6666666667, ans=0.125 2023-10-10 12:20:25,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=356883.3333333333, ans=0.0 2023-10-10 12:20:26,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=356883.3333333333, ans=0.04949747468305833 2023-10-10 12:20:47,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=356976.6666666667, ans=0.2 2023-10-10 12:20:51,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=356976.6666666667, ans=0.125 2023-10-10 12:20:51,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=356976.6666666667, ans=0.1 2023-10-10 12:20:53,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=356976.6666666667, ans=0.125 2023-10-10 12:21:19,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=357116.6666666667, ans=0.125 2023-10-10 12:21:19,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=357116.6666666667, ans=0.0 2023-10-10 12:21:28,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=357163.3333333333, ans=0.04949747468305833 2023-10-10 12:21:35,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=357163.3333333333, ans=0.125 2023-10-10 12:21:39,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357210.0, ans=0.1 2023-10-10 12:21:55,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357256.6666666667, ans=0.1 2023-10-10 12:21:58,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.271e+02 1.775e+02 2.039e+02 2.521e+02 4.705e+02, threshold=4.079e+02, percent-clipped=3.0 2023-10-10 12:22:30,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=357396.6666666667, ans=0.125 2023-10-10 12:22:34,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=357396.6666666667, ans=0.0 2023-10-10 12:22:36,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.17 vs. limit=22.5 2023-10-10 12:22:49,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-10-10 12:23:11,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=357583.3333333333, ans=0.0 2023-10-10 12:23:18,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-10 12:23:29,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=357630.0, ans=0.125 2023-10-10 12:23:31,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=12.0 2023-10-10 12:23:40,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=357676.6666666667, ans=0.0 2023-10-10 12:23:41,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=357676.6666666667, ans=0.125 2023-10-10 12:23:48,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=357723.3333333333, ans=0.125 2023-10-10 12:23:53,932 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.805e+02 2.040e+02 2.368e+02 3.381e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-10 12:23:54,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=357723.3333333333, ans=0.125 2023-10-10 12:24:03,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=357770.0, ans=0.05 2023-10-10 12:24:41,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=357910.0, ans=0.125 2023-10-10 12:24:50,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=357956.6666666667, ans=0.0 2023-10-10 12:24:59,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=358003.3333333333, ans=0.0 2023-10-10 12:25:20,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=358096.6666666667, ans=0.0 2023-10-10 12:25:33,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=358143.3333333333, ans=0.125 2023-10-10 12:25:35,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=358143.3333333333, ans=0.07 2023-10-10 12:25:42,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=12.0 2023-10-10 12:25:51,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=358190.0, ans=0.125 2023-10-10 12:25:51,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.670e+02 1.880e+02 2.066e+02 3.321e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-10 12:26:05,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=358236.6666666667, ans=0.0 2023-10-10 12:26:05,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2023-10-10 12:26:07,617 INFO [train.py:1031] (3/4) Epoch 6, batch 8500, loss[loss=0.2356, simple_loss=0.3214, pruned_loss=0.07485, over 16959.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3087, pruned_loss=0.07043, over 32304004.22 frames. ], batch size: 156, lr: 6.21e-03, grad_scale: 32.0 2023-10-10 12:26:31,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.19 vs. limit=15.0 2023-10-10 12:27:05,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-10 12:27:10,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358516.6666666667, ans=0.1 2023-10-10 12:27:44,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.13 vs. limit=10.0 2023-10-10 12:27:45,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.857e+02 2.075e+02 2.345e+02 3.260e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-10 12:28:15,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358750.0, ans=0.125 2023-10-10 12:28:17,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=358796.6666666667, ans=0.2 2023-10-10 12:28:21,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=358796.6666666667, ans=0.125 2023-10-10 12:28:38,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358890.0, ans=0.125 2023-10-10 12:29:02,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=358983.3333333333, ans=0.0 2023-10-10 12:29:03,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=358983.3333333333, ans=0.125 2023-10-10 12:29:03,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.20 vs. limit=15.0 2023-10-10 12:29:12,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-10-10 12:29:33,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.81 vs. limit=10.0 2023-10-10 12:29:48,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.662e+02 1.861e+02 2.157e+02 3.147e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-10 12:29:50,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=359170.0, ans=0.125 2023-10-10 12:30:19,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=359263.3333333333, ans=0.125 2023-10-10 12:30:21,326 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.86 vs. limit=15.0 2023-10-10 12:30:43,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=359356.6666666667, ans=0.04949747468305833 2023-10-10 12:30:54,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=359403.3333333333, ans=0.0 2023-10-10 12:31:00,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=359403.3333333333, ans=0.125 2023-10-10 12:31:31,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=359543.3333333333, ans=0.5 2023-10-10 12:31:38,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=359543.3333333333, ans=0.125 2023-10-10 12:31:41,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-10-10 12:31:42,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=359590.0, ans=0.0 2023-10-10 12:31:49,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.686e+02 1.989e+02 2.251e+02 3.842e+02, threshold=3.977e+02, percent-clipped=2.0 2023-10-10 12:31:55,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=359636.6666666667, ans=0.1 2023-10-10 12:32:03,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=15.0 2023-10-10 12:32:28,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=359776.6666666667, ans=0.125 2023-10-10 12:32:38,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-10-10 12:33:02,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=359916.6666666667, ans=0.0 2023-10-10 12:33:08,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.30 vs. limit=22.5 2023-10-10 12:33:25,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=22.5 2023-10-10 12:33:30,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-10 12:33:34,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.713e+02 1.890e+02 2.176e+02 3.589e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-10 12:34:10,724 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.91 vs. limit=6.0 2023-10-10 12:34:12,943 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:34:43,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360383.3333333333, ans=0.1 2023-10-10 12:34:54,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=360430.0, ans=0.2 2023-10-10 12:34:56,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=360430.0, ans=0.0 2023-10-10 12:35:22,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.729e+02 1.901e+02 2.175e+02 3.364e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 12:35:30,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=360570.0, ans=0.0 2023-10-10 12:35:35,232 INFO [train.py:1031] (3/4) Epoch 6, batch 9000, loss[loss=0.2146, simple_loss=0.3005, pruned_loss=0.0644, over 16624.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3082, pruned_loss=0.07031, over 32408965.03 frames. ], batch size: 61, lr: 6.19e-03, grad_scale: 32.0 2023-10-10 12:35:35,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=360616.6666666667, ans=10.0 2023-10-10 12:36:01,297 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:36:12,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=360756.6666666667, ans=0.125 2023-10-10 12:36:36,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-10-10 12:37:09,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.717e+02 1.886e+02 2.114e+02 3.208e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-10 12:37:10,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-10-10 12:37:13,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2023-10-10 12:37:20,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=361083.3333333333, ans=0.125 2023-10-10 12:37:20,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=361083.3333333333, ans=0.07 2023-10-10 12:37:29,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=361130.0, ans=0.015 2023-10-10 12:37:48,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=361176.6666666667, ans=0.125 2023-10-10 12:37:56,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.64 vs. limit=22.5 2023-10-10 12:38:02,790 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:38:11,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=361270.0, ans=0.125 2023-10-10 12:38:13,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=361316.6666666667, ans=0.04949747468305833 2023-10-10 12:38:14,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-10-10 12:38:14,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=361316.6666666667, ans=0.125 2023-10-10 12:38:40,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=361410.0, ans=0.0 2023-10-10 12:38:40,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=361410.0, ans=0.0 2023-10-10 12:38:48,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=22.5 2023-10-10 12:38:52,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.778e+02 1.956e+02 2.221e+02 2.884e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-10 12:39:00,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=361503.3333333333, ans=0.125 2023-10-10 12:39:11,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.56 vs. limit=15.0 2023-10-10 12:39:12,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-10 12:39:16,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=361596.6666666667, ans=0.0 2023-10-10 12:39:31,943 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:39:32,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2023-10-10 12:39:36,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=361690.0, ans=0.125 2023-10-10 12:39:37,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=361690.0, ans=0.0 2023-10-10 12:39:53,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=361736.6666666667, ans=0.2 2023-10-10 12:40:04,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=361783.3333333333, ans=0.125 2023-10-10 12:40:10,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361830.0, ans=0.1 2023-10-10 12:40:32,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=361923.3333333333, ans=0.025 2023-10-10 12:40:36,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.701e+02 1.935e+02 2.237e+02 4.476e+02, threshold=3.870e+02, percent-clipped=1.0 2023-10-10 12:40:51,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=22.5 2023-10-10 12:41:01,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=362063.3333333333, ans=0.2 2023-10-10 12:41:11,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=362110.0, ans=0.2 2023-10-10 12:41:16,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-10-10 12:41:41,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=362203.3333333333, ans=0.5 2023-10-10 12:41:41,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362203.3333333333, ans=0.1 2023-10-10 12:41:47,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=362250.0, ans=0.125 2023-10-10 12:42:14,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.01 vs. limit=22.5 2023-10-10 12:42:26,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=362390.0, ans=12.0 2023-10-10 12:42:34,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.818e+02 2.005e+02 2.455e+02 5.432e+02, threshold=4.010e+02, percent-clipped=3.0 2023-10-10 12:42:37,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=362436.6666666667, ans=0.0 2023-10-10 12:42:40,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.16 vs. limit=15.0 2023-10-10 12:43:18,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=362623.3333333333, ans=0.125 2023-10-10 12:43:46,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=362716.6666666667, ans=0.05 2023-10-10 12:43:50,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=362716.6666666667, ans=0.125 2023-10-10 12:43:54,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=362763.3333333333, ans=0.0 2023-10-10 12:44:26,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=362856.6666666667, ans=0.2 2023-10-10 12:44:28,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.703e+02 1.874e+02 2.099e+02 3.390e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-10 12:44:38,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=362903.3333333333, ans=0.125 2023-10-10 12:44:39,752 INFO [train.py:1031] (3/4) Epoch 6, batch 9500, loss[loss=0.2204, simple_loss=0.3058, pruned_loss=0.0675, over 16849.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3086, pruned_loss=0.07028, over 32500538.39 frames. ], batch size: 110, lr: 6.17e-03, grad_scale: 32.0 2023-10-10 12:44:49,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=362950.0, ans=0.05 2023-10-10 12:44:55,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=362996.6666666667, ans=0.125 2023-10-10 12:44:56,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=362996.6666666667, ans=0.2 2023-10-10 12:45:09,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363043.3333333333, ans=0.125 2023-10-10 12:45:15,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363090.0, ans=0.1 2023-10-10 12:45:18,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.86 vs. limit=6.0 2023-10-10 12:45:55,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=363276.6666666667, ans=0.125 2023-10-10 12:45:59,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363276.6666666667, ans=0.125 2023-10-10 12:46:15,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.642e+02 1.802e+02 2.142e+02 3.389e+02, threshold=3.604e+02, percent-clipped=0.0 2023-10-10 12:46:18,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=363370.0, ans=0.125 2023-10-10 12:46:29,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=363416.6666666667, ans=0.0 2023-10-10 12:46:58,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=363510.0, ans=12.0 2023-10-10 12:47:25,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=363650.0, ans=0.125 2023-10-10 12:47:26,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=363650.0, ans=0.05 2023-10-10 12:47:36,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=363696.6666666667, ans=0.2 2023-10-10 12:47:46,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=363696.6666666667, ans=0.04949747468305833 2023-10-10 12:47:58,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=15.0 2023-10-10 12:48:08,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.726e+02 1.940e+02 2.105e+02 3.124e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-10 12:48:20,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=363883.3333333333, ans=0.0 2023-10-10 12:48:21,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=363883.3333333333, ans=0.0 2023-10-10 12:48:22,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=363883.3333333333, ans=0.05 2023-10-10 12:48:42,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-10-10 12:48:44,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=363976.6666666667, ans=0.125 2023-10-10 12:48:50,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363976.6666666667, ans=0.1 2023-10-10 12:48:51,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.33 vs. limit=22.5 2023-10-10 12:48:52,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.38 vs. limit=15.0 2023-10-10 12:48:59,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=364023.3333333333, ans=0.125 2023-10-10 12:49:16,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=364116.6666666667, ans=0.1 2023-10-10 12:49:28,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.75 vs. limit=22.5 2023-10-10 12:49:28,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.10 vs. limit=15.0 2023-10-10 12:49:36,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=364210.0, ans=0.125 2023-10-10 12:49:55,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=364256.6666666667, ans=0.0 2023-10-10 12:49:57,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.687e+02 1.826e+02 2.048e+02 3.041e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-10 12:49:58,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.86 vs. limit=10.0 2023-10-10 12:50:12,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=364350.0, ans=0.035 2023-10-10 12:50:15,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=364350.0, ans=0.02 2023-10-10 12:50:21,325 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:50:23,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2023-10-10 12:50:33,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=364443.3333333333, ans=0.125 2023-10-10 12:50:36,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=364443.3333333333, ans=0.05 2023-10-10 12:50:46,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=364490.0, ans=0.0 2023-10-10 12:50:52,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.53 vs. limit=5.0 2023-10-10 12:50:53,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-10 12:51:01,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=364536.6666666667, ans=0.125 2023-10-10 12:51:12,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=364583.3333333333, ans=0.0 2023-10-10 12:51:20,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.01 vs. limit=10.0 2023-10-10 12:51:23,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=364630.0, ans=0.0 2023-10-10 12:51:39,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=364723.3333333333, ans=0.125 2023-10-10 12:51:47,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.689e+02 1.900e+02 2.177e+02 3.254e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-10 12:52:12,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=364863.3333333333, ans=0.2 2023-10-10 12:52:13,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=364863.3333333333, ans=0.125 2023-10-10 12:52:14,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=364863.3333333333, ans=0.0 2023-10-10 12:52:20,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=364910.0, ans=0.125 2023-10-10 12:52:44,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=365003.3333333333, ans=0.125 2023-10-10 12:52:44,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=365003.3333333333, ans=0.0 2023-10-10 12:52:57,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=365050.0, ans=0.125 2023-10-10 12:53:02,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=365096.6666666667, ans=0.0 2023-10-10 12:53:18,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=365143.3333333333, ans=0.2 2023-10-10 12:53:31,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.734e+02 2.013e+02 2.347e+02 3.330e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-10 12:53:35,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.07 vs. limit=22.5 2023-10-10 12:53:37,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=365236.6666666667, ans=0.0 2023-10-10 12:53:42,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-10-10 12:53:42,701 INFO [train.py:1031] (3/4) Epoch 6, batch 10000, loss[loss=0.2291, simple_loss=0.3187, pruned_loss=0.06974, over 16936.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3074, pruned_loss=0.0696, over 32566299.38 frames. ], batch size: 104, lr: 6.15e-03, grad_scale: 32.0 2023-10-10 12:53:59,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=365330.0, ans=0.0 2023-10-10 12:54:18,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=365423.3333333333, ans=15.0 2023-10-10 12:54:46,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=365563.3333333333, ans=0.0 2023-10-10 12:54:52,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=365563.3333333333, ans=0.0 2023-10-10 12:54:56,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-10-10 12:55:02,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365610.0, ans=0.1 2023-10-10 12:55:02,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=365610.0, ans=0.2 2023-10-10 12:55:10,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=365610.0, ans=0.125 2023-10-10 12:55:11,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=365656.6666666667, ans=0.0 2023-10-10 12:55:20,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.742e+02 1.931e+02 2.160e+02 3.075e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-10 12:55:21,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=365703.3333333333, ans=0.2 2023-10-10 12:55:54,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=365843.3333333333, ans=0.125 2023-10-10 12:56:07,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=365890.0, ans=0.0 2023-10-10 12:56:14,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=365890.0, ans=0.125 2023-10-10 12:56:14,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=365936.6666666667, ans=0.0 2023-10-10 12:56:28,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=365983.3333333333, ans=0.125 2023-10-10 12:56:43,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=366030.0, ans=0.125 2023-10-10 12:56:48,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=22.5 2023-10-10 12:56:57,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=366123.3333333333, ans=0.125 2023-10-10 12:57:06,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.762e+02 2.037e+02 2.354e+02 3.380e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 12:57:22,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=22.5 2023-10-10 12:57:26,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=366216.6666666667, ans=0.2 2023-10-10 12:57:56,532 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2023-10-10 12:57:57,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=366356.6666666667, ans=0.125 2023-10-10 12:58:09,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366403.3333333333, ans=0.1 2023-10-10 12:58:14,137 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-10-10 12:58:18,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=366450.0, ans=0.125 2023-10-10 12:58:26,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=366450.0, ans=0.04949747468305833 2023-10-10 12:58:31,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=366496.6666666667, ans=0.125 2023-10-10 12:58:39,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=366543.3333333333, ans=0.125 2023-10-10 12:58:41,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=366543.3333333333, ans=0.04949747468305833 2023-10-10 12:58:48,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.29 vs. limit=15.0 2023-10-10 12:58:55,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=366590.0, ans=0.2 2023-10-10 12:59:05,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.657e+02 1.828e+02 2.102e+02 3.735e+02, threshold=3.656e+02, percent-clipped=0.0 2023-10-10 12:59:14,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=366636.6666666667, ans=0.0 2023-10-10 12:59:18,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=366683.3333333333, ans=0.2 2023-10-10 12:59:24,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=366683.3333333333, ans=0.125 2023-10-10 12:59:26,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-10-10 12:59:28,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366730.0, ans=0.1 2023-10-10 12:59:49,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=366823.3333333333, ans=0.125 2023-10-10 13:00:20,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=366916.6666666667, ans=0.2 2023-10-10 13:00:24,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=366963.3333333333, ans=0.125 2023-10-10 13:00:29,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=366963.3333333333, ans=0.125 2023-10-10 13:00:33,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-10 13:00:45,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=367010.0, ans=0.125 2023-10-10 13:00:56,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.637e+02 1.786e+02 1.945e+02 3.199e+02, threshold=3.572e+02, percent-clipped=0.0 2023-10-10 13:01:56,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=367290.0, ans=0.1 2023-10-10 13:02:05,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=367336.6666666667, ans=0.04949747468305833 2023-10-10 13:02:13,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=367383.3333333333, ans=0.125 2023-10-10 13:02:15,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=367383.3333333333, ans=0.2 2023-10-10 13:02:15,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=367383.3333333333, ans=0.125 2023-10-10 13:02:37,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=367476.6666666667, ans=0.2 2023-10-10 13:02:46,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=367523.3333333333, ans=0.0 2023-10-10 13:02:51,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.737e+02 1.918e+02 2.197e+02 3.372e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-10 13:02:55,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=15.0 2023-10-10 13:02:59,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-10-10 13:03:00,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=367616.6666666667, ans=0.125 2023-10-10 13:03:01,391 INFO [train.py:1031] (3/4) Epoch 6, batch 10500, loss[loss=0.2186, simple_loss=0.3055, pruned_loss=0.06585, over 16629.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3078, pruned_loss=0.06978, over 32590288.59 frames. ], batch size: 219, lr: 6.14e-03, grad_scale: 32.0 2023-10-10 13:03:09,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-10-10 13:03:17,384 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:03:34,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=367756.6666666667, ans=0.0 2023-10-10 13:03:38,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=367756.6666666667, ans=0.0 2023-10-10 13:03:46,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=367803.3333333333, ans=0.0 2023-10-10 13:04:01,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=367850.0, ans=0.0 2023-10-10 13:04:42,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.728e+02 1.994e+02 2.242e+02 4.006e+02, threshold=3.989e+02, percent-clipped=1.0 2023-10-10 13:04:49,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-10-10 13:05:06,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.97 vs. limit=22.5 2023-10-10 13:05:10,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=368130.0, ans=0.125 2023-10-10 13:05:31,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=6.0 2023-10-10 13:05:44,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=368270.0, ans=0.125 2023-10-10 13:05:59,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=368316.6666666667, ans=0.025 2023-10-10 13:06:32,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=368456.6666666667, ans=0.125 2023-10-10 13:06:36,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.731e+02 1.969e+02 2.237e+02 3.458e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-10 13:06:39,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=368503.3333333333, ans=0.125 2023-10-10 13:06:39,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.50 vs. limit=22.5 2023-10-10 13:06:51,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.33 vs. limit=12.0 2023-10-10 13:07:04,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=368596.6666666667, ans=0.125 2023-10-10 13:07:29,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=368736.6666666667, ans=0.0 2023-10-10 13:07:37,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.41 vs. limit=15.0 2023-10-10 13:07:40,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.99 vs. limit=15.0 2023-10-10 13:08:04,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=368876.6666666667, ans=0.0 2023-10-10 13:08:24,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.886e+02 2.129e+02 2.476e+02 3.785e+02, threshold=4.257e+02, percent-clipped=0.0 2023-10-10 13:08:26,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=15.0 2023-10-10 13:08:51,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369063.3333333333, ans=0.1 2023-10-10 13:08:51,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=369063.3333333333, ans=0.0 2023-10-10 13:09:11,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=369156.6666666667, ans=0.125 2023-10-10 13:09:35,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=369250.0, ans=0.125 2023-10-10 13:09:37,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=369250.0, ans=0.2 2023-10-10 13:09:37,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=369296.6666666667, ans=0.2 2023-10-10 13:09:41,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-10-10 13:09:44,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369296.6666666667, ans=0.1 2023-10-10 13:10:15,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.217e+02 1.749e+02 1.914e+02 2.119e+02 3.101e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-10 13:10:16,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369436.6666666667, ans=0.125 2023-10-10 13:10:32,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=369483.3333333333, ans=0.05 2023-10-10 13:10:33,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=369530.0, ans=0.125 2023-10-10 13:10:41,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=15.0 2023-10-10 13:10:53,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-10-10 13:11:20,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=369716.6666666667, ans=0.125 2023-10-10 13:11:27,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=369763.3333333333, ans=0.0 2023-10-10 13:11:29,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.51 vs. limit=15.0 2023-10-10 13:11:44,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.59 vs. limit=22.5 2023-10-10 13:11:45,709 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:12:02,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.250e+02 1.716e+02 2.021e+02 2.413e+02 3.624e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-10 13:12:11,169 INFO [train.py:1031] (3/4) Epoch 6, batch 11000, loss[loss=0.2368, simple_loss=0.3248, pruned_loss=0.07445, over 16892.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3078, pruned_loss=0.0698, over 32625016.11 frames. ], batch size: 87, lr: 6.12e-03, grad_scale: 16.0 2023-10-10 13:12:25,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369996.6666666667, ans=0.1 2023-10-10 13:12:28,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=15.0 2023-10-10 13:12:30,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=369996.6666666667, ans=0.125 2023-10-10 13:12:31,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=370043.3333333333, ans=10.0 2023-10-10 13:12:40,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=370043.3333333333, ans=0.1 2023-10-10 13:12:53,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=370136.6666666667, ans=0.125 2023-10-10 13:13:12,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=370183.3333333333, ans=0.1 2023-10-10 13:13:21,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=370230.0, ans=0.125 2023-10-10 13:13:27,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=370230.0, ans=0.0 2023-10-10 13:13:29,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=370276.6666666667, ans=0.0 2023-10-10 13:13:45,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=370323.3333333333, ans=0.125 2023-10-10 13:13:47,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=370323.3333333333, ans=0.0 2023-10-10 13:13:52,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.716e+02 1.912e+02 2.182e+02 2.801e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 13:14:13,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=370416.6666666667, ans=0.0 2023-10-10 13:14:29,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=370510.0, ans=0.125 2023-10-10 13:14:46,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=370556.6666666667, ans=0.125 2023-10-10 13:15:02,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=370603.3333333333, ans=0.1 2023-10-10 13:15:06,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=370650.0, ans=0.2 2023-10-10 13:15:24,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=370743.3333333333, ans=0.125 2023-10-10 13:15:47,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.640e+02 1.802e+02 2.034e+02 2.739e+02, threshold=3.604e+02, percent-clipped=0.0 2023-10-10 13:15:58,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=370883.3333333333, ans=0.0 2023-10-10 13:16:04,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=370883.3333333333, ans=15.0 2023-10-10 13:16:36,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=371023.3333333333, ans=0.125 2023-10-10 13:16:37,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=371023.3333333333, ans=0.125 2023-10-10 13:16:38,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371023.3333333333, ans=0.1 2023-10-10 13:16:46,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=371070.0, ans=0.0 2023-10-10 13:16:48,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=371070.0, ans=0.2 2023-10-10 13:16:51,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=371116.6666666667, ans=0.05 2023-10-10 13:16:51,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=371116.6666666667, ans=0.0 2023-10-10 13:17:12,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-10 13:17:13,804 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:17:20,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=371210.0, ans=0.2 2023-10-10 13:17:33,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=371256.6666666667, ans=0.125 2023-10-10 13:17:41,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.247e+02 1.807e+02 1.999e+02 2.343e+02 3.683e+02, threshold=3.997e+02, percent-clipped=1.0 2023-10-10 13:17:48,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371303.3333333333, ans=0.1 2023-10-10 13:18:11,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=371396.6666666667, ans=0.0 2023-10-10 13:18:40,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-10-10 13:19:31,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.653e+02 1.869e+02 2.128e+02 3.404e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-10 13:19:37,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=371770.0, ans=0.2 2023-10-10 13:19:41,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-10-10 13:19:46,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=371816.6666666667, ans=0.125 2023-10-10 13:19:51,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-10-10 13:21:08,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-10 13:21:09,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=372190.0, ans=0.125 2023-10-10 13:21:22,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.854e+02 2.045e+02 2.458e+02 3.775e+02, threshold=4.090e+02, percent-clipped=1.0 2023-10-10 13:21:32,043 INFO [train.py:1031] (3/4) Epoch 6, batch 11500, loss[loss=0.2194, simple_loss=0.3212, pruned_loss=0.05886, over 16849.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3075, pruned_loss=0.06965, over 32648341.84 frames. ], batch size: 104, lr: 6.10e-03, grad_scale: 32.0 2023-10-10 13:21:32,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=372283.3333333333, ans=0.0 2023-10-10 13:21:33,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=372283.3333333333, ans=0.125 2023-10-10 13:21:41,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.51 vs. limit=22.5 2023-10-10 13:21:41,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372330.0, ans=0.1 2023-10-10 13:22:03,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=372423.3333333333, ans=0.125 2023-10-10 13:22:17,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=372470.0, ans=0.0 2023-10-10 13:22:24,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372470.0, ans=0.1 2023-10-10 13:22:31,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=372516.6666666667, ans=0.125 2023-10-10 13:22:52,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=372610.0, ans=0.2 2023-10-10 13:23:06,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.94 vs. limit=22.5 2023-10-10 13:23:16,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.663e+02 1.941e+02 2.216e+02 3.108e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 13:24:09,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=372890.0, ans=0.125 2023-10-10 13:24:15,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=372936.6666666667, ans=0.125 2023-10-10 13:24:40,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=373030.0, ans=0.0 2023-10-10 13:24:50,221 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:25:05,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.595e+02 1.777e+02 2.084e+02 3.083e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-10 13:25:12,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=373170.0, ans=0.5 2023-10-10 13:25:27,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=373263.3333333333, ans=0.0 2023-10-10 13:26:39,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=373496.6666666667, ans=0.125 2023-10-10 13:27:01,707 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:27:08,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.664e+02 1.787e+02 1.993e+02 3.081e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-10 13:27:08,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=373636.6666666667, ans=0.125 2023-10-10 13:27:16,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=12.0 2023-10-10 13:27:34,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=373730.0, ans=0.125 2023-10-10 13:27:58,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.96 vs. limit=22.5 2023-10-10 13:28:04,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=22.5 2023-10-10 13:28:09,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=373870.0, ans=0.125 2023-10-10 13:28:12,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=373870.0, ans=0.125 2023-10-10 13:28:15,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.37 vs. limit=12.0 2023-10-10 13:28:15,422 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-10-10 13:28:25,551 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.97 vs. limit=15.0 2023-10-10 13:28:39,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=374010.0, ans=0.0 2023-10-10 13:28:52,203 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:29:05,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=374103.3333333333, ans=0.125 2023-10-10 13:29:06,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.777e+02 2.003e+02 2.440e+02 3.027e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-10 13:29:08,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=374103.3333333333, ans=0.0 2023-10-10 13:29:12,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-10-10 13:29:15,000 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-10-10 13:29:29,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=374196.6666666667, ans=0.0 2023-10-10 13:29:39,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=12.0 2023-10-10 13:29:45,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=374243.3333333333, ans=0.125 2023-10-10 13:29:48,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=374290.0, ans=0.015 2023-10-10 13:29:58,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=374336.6666666667, ans=0.125 2023-10-10 13:30:17,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2023-10-10 13:30:22,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=374430.0, ans=0.0 2023-10-10 13:30:35,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-10-10 13:30:38,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=374476.6666666667, ans=0.0 2023-10-10 13:30:49,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=374523.3333333333, ans=0.0 2023-10-10 13:30:53,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=374570.0, ans=0.1 2023-10-10 13:30:54,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.656e+02 1.858e+02 2.088e+02 2.828e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-10 13:30:55,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-10 13:30:59,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.11 vs. limit=22.5 2023-10-10 13:31:04,971 INFO [train.py:1031] (3/4) Epoch 6, batch 12000, loss[loss=0.2358, simple_loss=0.3206, pruned_loss=0.07554, over 16825.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3075, pruned_loss=0.06928, over 32677130.49 frames. ], batch size: 72, lr: 6.08e-03, grad_scale: 32.0 2023-10-10 13:31:16,692 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=22.5 2023-10-10 13:31:19,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=374663.3333333333, ans=0.2 2023-10-10 13:31:20,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=374663.3333333333, ans=0.125 2023-10-10 13:31:47,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=374803.3333333333, ans=0.125 2023-10-10 13:31:49,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.12 vs. limit=15.0 2023-10-10 13:31:49,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-10-10 13:31:59,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=374850.0, ans=0.0 2023-10-10 13:32:07,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=374850.0, ans=0.05 2023-10-10 13:32:16,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=374896.6666666667, ans=0.04949747468305833 2023-10-10 13:32:16,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=374896.6666666667, ans=0.125 2023-10-10 13:32:22,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=374943.3333333333, ans=0.015 2023-10-10 13:32:22,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=374943.3333333333, ans=0.2 2023-10-10 13:32:38,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=374990.0, ans=0.1 2023-10-10 13:32:45,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.758e+02 2.001e+02 2.500e+02 3.638e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-10 13:32:47,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=375036.6666666667, ans=0.0 2023-10-10 13:32:57,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=375083.3333333333, ans=0.125 2023-10-10 13:33:03,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=375130.0, ans=0.04949747468305833 2023-10-10 13:33:12,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=375130.0, ans=0.125 2023-10-10 13:33:34,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=375270.0, ans=0.0 2023-10-10 13:33:39,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375270.0, ans=0.1 2023-10-10 13:33:42,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=375270.0, ans=0.1 2023-10-10 13:33:47,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=375316.6666666667, ans=0.125 2023-10-10 13:34:00,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=375363.3333333333, ans=0.125 2023-10-10 13:34:17,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=375456.6666666667, ans=0.125 2023-10-10 13:34:21,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2023-10-10 13:34:22,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=375456.6666666667, ans=0.2 2023-10-10 13:34:24,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=375456.6666666667, ans=0.2 2023-10-10 13:34:28,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.736e+02 2.041e+02 2.376e+02 4.195e+02, threshold=4.081e+02, percent-clipped=1.0 2023-10-10 13:34:40,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=375550.0, ans=0.1 2023-10-10 13:34:48,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375596.6666666667, ans=0.1 2023-10-10 13:34:53,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=375596.6666666667, ans=10.0 2023-10-10 13:35:06,587 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.54 vs. limit=22.5 2023-10-10 13:35:10,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-10-10 13:35:31,170 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:35:32,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=375783.3333333333, ans=0.125 2023-10-10 13:35:58,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-10 13:36:03,130 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=22.5 2023-10-10 13:36:06,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=375923.3333333333, ans=0.125 2023-10-10 13:36:18,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.869e+02 2.159e+02 2.474e+02 4.078e+02, threshold=4.318e+02, percent-clipped=0.0 2023-10-10 13:36:30,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=376016.6666666667, ans=0.125 2023-10-10 13:36:36,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376063.3333333333, ans=0.1 2023-10-10 13:36:47,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=376110.0, ans=0.07 2023-10-10 13:36:56,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=376156.6666666667, ans=0.125 2023-10-10 13:37:07,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=376203.3333333333, ans=0.125 2023-10-10 13:37:17,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=376203.3333333333, ans=0.125 2023-10-10 13:37:20,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=376250.0, ans=0.0 2023-10-10 13:37:25,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=376250.0, ans=0.2 2023-10-10 13:37:42,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.01 vs. limit=10.0 2023-10-10 13:37:53,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=376343.3333333333, ans=0.125 2023-10-10 13:37:53,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=376343.3333333333, ans=0.0 2023-10-10 13:38:08,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2023-10-10 13:38:09,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.718e+02 2.011e+02 2.395e+02 3.700e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-10 13:38:12,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-10-10 13:38:34,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=376530.0, ans=0.125 2023-10-10 13:38:48,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.23 vs. limit=15.0 2023-10-10 13:38:59,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=376623.3333333333, ans=0.125 2023-10-10 13:39:03,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=376670.0, ans=0.125 2023-10-10 13:39:18,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=376716.6666666667, ans=0.0 2023-10-10 13:39:20,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=376716.6666666667, ans=0.125 2023-10-10 13:40:00,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.772e+02 1.997e+02 2.390e+02 3.375e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-10 13:40:02,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=376903.3333333333, ans=0.0 2023-10-10 13:40:06,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376903.3333333333, ans=0.1 2023-10-10 13:40:07,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=376950.0, ans=0.0 2023-10-10 13:40:08,622 INFO [train.py:1031] (3/4) Epoch 6, batch 12500, loss[loss=0.1996, simple_loss=0.2954, pruned_loss=0.05191, over 16892.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3073, pruned_loss=0.06933, over 32711846.17 frames. ], batch size: 82, lr: 6.06e-03, grad_scale: 32.0 2023-10-10 13:40:13,096 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:40:16,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.37 vs. limit=15.0 2023-10-10 13:40:18,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=376996.6666666667, ans=0.2 2023-10-10 13:41:02,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=377183.3333333333, ans=0.125 2023-10-10 13:41:21,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=377276.6666666667, ans=0.125 2023-10-10 13:41:25,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=15.0 2023-10-10 13:41:44,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.663e+02 1.795e+02 2.126e+02 3.091e+02, threshold=3.590e+02, percent-clipped=0.0 2023-10-10 13:41:47,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=377370.0, ans=0.2 2023-10-10 13:41:49,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=377370.0, ans=0.0 2023-10-10 13:41:54,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=377416.6666666667, ans=0.0 2023-10-10 13:41:55,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=377416.6666666667, ans=0.125 2023-10-10 13:42:13,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=377510.0, ans=0.0 2023-10-10 13:42:23,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=377510.0, ans=0.0 2023-10-10 13:43:00,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377696.6666666667, ans=0.1 2023-10-10 13:43:02,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=15.0 2023-10-10 13:43:04,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=377696.6666666667, ans=0.125 2023-10-10 13:43:34,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.222e+02 1.681e+02 2.004e+02 2.331e+02 3.544e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 13:43:42,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=377883.3333333333, ans=0.125 2023-10-10 13:43:48,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=377883.3333333333, ans=0.0 2023-10-10 13:44:10,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=377976.6666666667, ans=0.125 2023-10-10 13:44:13,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=378023.3333333333, ans=0.125 2023-10-10 13:44:14,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=378023.3333333333, ans=0.0 2023-10-10 13:44:29,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=378070.0, ans=0.125 2023-10-10 13:44:47,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.31 vs. limit=15.0 2023-10-10 13:44:54,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=378163.3333333333, ans=0.0 2023-10-10 13:45:15,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378256.6666666667, ans=0.1 2023-10-10 13:45:19,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=378303.3333333333, ans=0.0 2023-10-10 13:45:19,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.759e+02 1.949e+02 2.206e+02 3.201e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-10 13:45:40,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=378396.6666666667, ans=10.0 2023-10-10 13:45:41,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.80 vs. limit=10.0 2023-10-10 13:45:52,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.29 vs. limit=5.0 2023-10-10 13:46:07,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=378490.0, ans=0.125 2023-10-10 13:46:17,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=378536.6666666667, ans=15.0 2023-10-10 13:46:43,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=378630.0, ans=0.125 2023-10-10 13:46:45,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=378676.6666666667, ans=0.2 2023-10-10 13:46:48,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=378676.6666666667, ans=0.0 2023-10-10 13:46:49,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=6.0 2023-10-10 13:46:54,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.95 vs. limit=15.0 2023-10-10 13:47:02,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378723.3333333333, ans=0.1 2023-10-10 13:47:13,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.735e+02 1.991e+02 2.302e+02 3.344e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-10 13:47:13,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=378770.0, ans=0.125 2023-10-10 13:47:22,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378816.6666666667, ans=0.1 2023-10-10 13:47:27,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=22.5 2023-10-10 13:47:53,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=378956.6666666667, ans=0.125 2023-10-10 13:48:20,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379050.0, ans=0.1 2023-10-10 13:48:31,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379096.6666666667, ans=0.0 2023-10-10 13:48:39,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=379143.3333333333, ans=0.125 2023-10-10 13:48:45,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=12.0 2023-10-10 13:48:50,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=379190.0, ans=0.125 2023-10-10 13:48:50,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=379190.0, ans=0.2 2023-10-10 13:48:53,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.79 vs. limit=15.0 2023-10-10 13:48:58,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.715e+02 1.925e+02 2.276e+02 3.482e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-10 13:49:06,786 INFO [train.py:1031] (3/4) Epoch 6, batch 13000, loss[loss=0.2222, simple_loss=0.2766, pruned_loss=0.08392, over 12225.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3078, pruned_loss=0.06941, over 32731140.89 frames. ], batch size: 440, lr: 6.04e-03, grad_scale: 16.0 2023-10-10 13:49:11,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=379283.3333333333, ans=0.0 2023-10-10 13:49:28,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379330.0, ans=0.1 2023-10-10 13:50:08,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=379516.6666666667, ans=0.125 2023-10-10 13:50:18,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379563.3333333333, ans=0.1 2023-10-10 13:50:21,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=379563.3333333333, ans=0.125 2023-10-10 13:50:38,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.12 vs. limit=15.0 2023-10-10 13:50:57,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.668e+02 1.952e+02 2.206e+02 2.975e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-10 13:51:05,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=379750.0, ans=0.0 2023-10-10 13:51:31,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=379843.3333333333, ans=0.2 2023-10-10 13:51:35,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-10-10 13:51:50,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=379936.6666666667, ans=0.125 2023-10-10 13:51:59,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=22.5 2023-10-10 13:52:03,391 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:52:04,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379983.3333333333, ans=0.0 2023-10-10 13:52:13,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=380030.0, ans=0.0 2023-10-10 13:52:13,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380030.0, ans=0.1 2023-10-10 13:52:44,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=380123.3333333333, ans=0.07 2023-10-10 13:52:45,043 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:52:52,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.899e+02 2.307e+02 2.684e+02 3.932e+02, threshold=4.614e+02, percent-clipped=1.0 2023-10-10 13:53:11,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=380263.3333333333, ans=0.125 2023-10-10 13:53:14,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.76 vs. limit=15.0 2023-10-10 13:53:33,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=380356.6666666667, ans=0.125 2023-10-10 13:53:40,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=380403.3333333333, ans=0.125 2023-10-10 13:53:45,146 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-10-10 13:53:45,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-10 13:53:51,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=380450.0, ans=0.0 2023-10-10 13:53:56,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=380450.0, ans=0.125 2023-10-10 13:54:04,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=380496.6666666667, ans=0.1 2023-10-10 13:54:05,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=380496.6666666667, ans=0.025 2023-10-10 13:54:17,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.68 vs. limit=22.5 2023-10-10 13:54:17,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=380543.3333333333, ans=0.125 2023-10-10 13:54:39,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.693e+02 1.907e+02 2.184e+02 2.766e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-10 13:54:49,815 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-10 13:54:53,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=380683.3333333333, ans=0.0 2023-10-10 13:55:03,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=380730.0, ans=0.2 2023-10-10 13:55:08,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-10 13:55:10,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=380776.6666666667, ans=0.125 2023-10-10 13:55:13,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=380776.6666666667, ans=0.09899494936611666 2023-10-10 13:55:14,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.19 vs. limit=22.5 2023-10-10 13:55:42,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=380916.6666666667, ans=0.0 2023-10-10 13:55:48,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=380963.3333333333, ans=0.2 2023-10-10 13:55:58,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.87 vs. limit=15.0 2023-10-10 13:55:58,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=381010.0, ans=0.125 2023-10-10 13:55:59,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381010.0, ans=0.1 2023-10-10 13:56:06,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=381010.0, ans=0.0 2023-10-10 13:56:07,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=381010.0, ans=0.125 2023-10-10 13:56:24,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.813e+02 2.019e+02 2.311e+02 3.291e+02, threshold=4.039e+02, percent-clipped=0.0 2023-10-10 13:56:24,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=381103.3333333333, ans=0.2 2023-10-10 13:56:26,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381103.3333333333, ans=0.1 2023-10-10 13:56:38,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=381150.0, ans=0.2 2023-10-10 13:56:42,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381150.0, ans=0.1 2023-10-10 13:56:48,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-10-10 13:57:17,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=381336.6666666667, ans=0.125 2023-10-10 13:57:19,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=381336.6666666667, ans=0.0 2023-10-10 13:57:22,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=381336.6666666667, ans=0.125 2023-10-10 13:57:32,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=381383.3333333333, ans=0.0 2023-10-10 13:57:41,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-10-10 13:57:49,462 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:58:07,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=381523.3333333333, ans=0.125 2023-10-10 13:58:07,260 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.66 vs. limit=15.0 2023-10-10 13:58:14,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.676e+02 1.841e+02 2.050e+02 2.878e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-10 13:58:20,599 INFO [train.py:1031] (3/4) Epoch 6, batch 13500, loss[loss=0.2236, simple_loss=0.3066, pruned_loss=0.07027, over 16567.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3069, pruned_loss=0.06883, over 32764084.46 frames. ], batch size: 241, lr: 6.02e-03, grad_scale: 16.0 2023-10-10 13:58:26,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=381616.6666666667, ans=0.015 2023-10-10 13:58:26,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=381616.6666666667, ans=0.125 2023-10-10 13:58:33,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=381663.3333333333, ans=0.125 2023-10-10 13:58:35,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=381663.3333333333, ans=0.125 2023-10-10 13:58:37,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=381663.3333333333, ans=0.125 2023-10-10 13:58:38,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=381663.3333333333, ans=0.125 2023-10-10 13:59:03,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381803.3333333333, ans=0.125 2023-10-10 13:59:06,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=15.0 2023-10-10 13:59:07,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=381803.3333333333, ans=0.0 2023-10-10 13:59:09,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=381803.3333333333, ans=0.125 2023-10-10 13:59:11,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=381803.3333333333, ans=0.125 2023-10-10 13:59:12,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=381803.3333333333, ans=0.2 2023-10-10 13:59:16,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=381850.0, ans=0.125 2023-10-10 13:59:26,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=381896.6666666667, ans=0.125 2023-10-10 13:59:59,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.797e+02 1.967e+02 2.458e+02 3.442e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-10 14:00:25,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.73 vs. limit=22.5 2023-10-10 14:00:36,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=15.0 2023-10-10 14:00:37,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=382223.3333333333, ans=0.125 2023-10-10 14:01:27,297 INFO [train.py:1031] (3/4) Epoch 7, batch 0, loss[loss=0.207, simple_loss=0.2916, pruned_loss=0.06122, over 16108.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2916, pruned_loss=0.06122, over 16108.00 frames. ], batch size: 296, lr: 5.51e-03, grad_scale: 32.0 2023-10-10 14:01:27,298 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-10 14:01:35,196 INFO [train.py:1063] (3/4) Epoch 7, validation: loss=0.2282, simple_loss=0.3154, pruned_loss=0.07055, over 1020973.00 frames. 2023-10-10 14:01:35,197 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-10 14:01:39,580 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2023-10-10 14:01:41,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.58 vs. limit=22.5 2023-10-10 14:01:45,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.97 vs. limit=10.0 2023-10-10 14:02:03,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=382433.3333333333, ans=0.0 2023-10-10 14:02:12,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-10-10 14:02:18,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.759e+02 1.982e+02 2.350e+02 4.264e+02, threshold=3.963e+02, percent-clipped=2.0 2023-10-10 14:02:27,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=382526.6666666667, ans=0.0 2023-10-10 14:02:44,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-10-10 14:03:07,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=382713.3333333333, ans=0.0 2023-10-10 14:03:47,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=382853.3333333333, ans=0.0 2023-10-10 14:04:12,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.683e+02 1.817e+02 2.092e+02 3.494e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-10 14:04:12,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=382993.3333333333, ans=0.125 2023-10-10 14:04:22,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=383040.0, ans=0.125 2023-10-10 14:04:37,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=383086.6666666667, ans=0.0 2023-10-10 14:04:44,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=383133.3333333333, ans=0.125 2023-10-10 14:04:59,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=383180.0, ans=0.1 2023-10-10 14:05:04,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=383226.6666666667, ans=0.0 2023-10-10 14:05:07,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-10-10 14:05:53,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=383413.3333333333, ans=0.5 2023-10-10 14:05:53,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=15.0 2023-10-10 14:05:59,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383413.3333333333, ans=0.1 2023-10-10 14:06:05,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.685e+02 1.925e+02 2.168e+02 2.777e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-10 14:06:08,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=383460.0, ans=0.125 2023-10-10 14:06:11,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=383460.0, ans=0.1 2023-10-10 14:06:30,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2023-10-10 14:06:35,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=383553.3333333333, ans=0.125 2023-10-10 14:06:56,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=383646.6666666667, ans=0.2 2023-10-10 14:06:59,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=383693.3333333333, ans=0.125 2023-10-10 14:07:07,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=383693.3333333333, ans=0.5 2023-10-10 14:07:26,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=383786.6666666667, ans=0.125 2023-10-10 14:07:31,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=383786.6666666667, ans=0.0 2023-10-10 14:07:42,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=383833.3333333333, ans=0.04949747468305833 2023-10-10 14:07:47,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=383880.0, ans=0.2 2023-10-10 14:07:57,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.657e+02 1.844e+02 2.059e+02 3.367e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 14:07:59,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.00 vs. limit=15.0 2023-10-10 14:08:04,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=383926.6666666667, ans=0.0 2023-10-10 14:09:23,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=384300.0, ans=0.1 2023-10-10 14:09:31,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384300.0, ans=0.0 2023-10-10 14:09:36,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=384346.6666666667, ans=0.05 2023-10-10 14:09:45,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.740e+02 1.924e+02 2.228e+02 3.092e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-10 14:09:46,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=384393.3333333333, ans=0.0 2023-10-10 14:09:56,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=384393.3333333333, ans=0.09899494936611666 2023-10-10 14:10:10,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=384486.6666666667, ans=0.2 2023-10-10 14:10:14,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=384486.6666666667, ans=0.0 2023-10-10 14:10:24,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=384533.3333333333, ans=0.0 2023-10-10 14:10:30,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=384533.3333333333, ans=0.2 2023-10-10 14:10:32,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=384533.3333333333, ans=0.0 2023-10-10 14:10:37,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=384580.0, ans=0.05 2023-10-10 14:10:38,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=384580.0, ans=0.035 2023-10-10 14:10:39,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=384580.0, ans=0.2 2023-10-10 14:10:39,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=384580.0, ans=0.125 2023-10-10 14:10:49,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.77 vs. limit=22.5 2023-10-10 14:10:55,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=384626.6666666667, ans=0.125 2023-10-10 14:10:57,797 INFO [train.py:1031] (3/4) Epoch 7, batch 500, loss[loss=0.1952, simple_loss=0.2599, pruned_loss=0.06521, over 12955.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3058, pruned_loss=0.06836, over 7279626.58 frames. ], batch size: 440, lr: 5.49e-03, grad_scale: 16.0 2023-10-10 14:11:13,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=384720.0, ans=0.125 2023-10-10 14:11:18,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-10-10 14:11:44,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.693e+02 1.917e+02 2.195e+02 3.145e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-10 14:11:45,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384860.0, ans=0.1 2023-10-10 14:12:04,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=384906.6666666667, ans=0.5 2023-10-10 14:12:06,339 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:12:12,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=384953.3333333333, ans=0.0 2023-10-10 14:12:14,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384953.3333333333, ans=0.1 2023-10-10 14:12:21,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=385000.0, ans=0.125 2023-10-10 14:12:29,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=385046.6666666667, ans=0.0 2023-10-10 14:12:37,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=385046.6666666667, ans=0.125 2023-10-10 14:12:59,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=385140.0, ans=0.0 2023-10-10 14:13:01,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=385140.0, ans=0.0 2023-10-10 14:13:10,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=385186.6666666667, ans=0.125 2023-10-10 14:13:11,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=385186.6666666667, ans=0.0 2023-10-10 14:13:13,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=385233.3333333333, ans=0.0 2023-10-10 14:13:17,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-10-10 14:13:40,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.808e+02 2.070e+02 2.377e+02 3.394e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-10 14:13:42,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=22.5 2023-10-10 14:13:50,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=385373.3333333333, ans=0.125 2023-10-10 14:13:51,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=385373.3333333333, ans=0.2 2023-10-10 14:13:51,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-10-10 14:13:53,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=385373.3333333333, ans=0.0 2023-10-10 14:14:04,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=12.0 2023-10-10 14:14:16,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=385466.6666666667, ans=0.07 2023-10-10 14:14:38,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=385560.0, ans=0.0 2023-10-10 14:14:52,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=385606.6666666667, ans=0.2 2023-10-10 14:15:00,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=6.0 2023-10-10 14:15:03,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.40 vs. limit=10.0 2023-10-10 14:15:17,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=385700.0, ans=0.2 2023-10-10 14:15:21,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-10-10 14:15:30,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=385746.6666666667, ans=0.125 2023-10-10 14:15:34,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.768e+02 1.936e+02 2.120e+02 2.899e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-10 14:15:44,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-10-10 14:15:53,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=385840.0, ans=0.2 2023-10-10 14:16:06,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=385933.3333333333, ans=10.0 2023-10-10 14:16:16,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=385933.3333333333, ans=0.0 2023-10-10 14:16:22,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.58 vs. limit=15.0 2023-10-10 14:16:49,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-10-10 14:17:03,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=386166.6666666667, ans=0.125 2023-10-10 14:17:11,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=386166.6666666667, ans=0.025 2023-10-10 14:17:37,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.718e+02 1.869e+02 2.020e+02 3.098e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-10 14:17:43,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=386260.0, ans=0.125 2023-10-10 14:17:56,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=386306.6666666667, ans=0.0 2023-10-10 14:18:03,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.07 vs. limit=15.0 2023-10-10 14:18:11,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=386400.0, ans=0.2 2023-10-10 14:18:21,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=386446.6666666667, ans=0.0 2023-10-10 14:18:27,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=386446.6666666667, ans=0.035 2023-10-10 14:18:30,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=386493.3333333333, ans=0.95 2023-10-10 14:18:40,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=386493.3333333333, ans=0.025 2023-10-10 14:18:47,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=386540.0, ans=0.125 2023-10-10 14:18:59,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386586.6666666667, ans=0.1 2023-10-10 14:19:05,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=386633.3333333333, ans=0.0 2023-10-10 14:19:15,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.30 vs. limit=15.0 2023-10-10 14:19:30,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.688e+02 1.839e+02 1.997e+02 3.239e+02, threshold=3.679e+02, percent-clipped=0.0 2023-10-10 14:19:39,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.37 vs. limit=15.0 2023-10-10 14:20:21,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=386913.3333333333, ans=0.125 2023-10-10 14:20:37,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-10-10 14:20:39,978 INFO [train.py:1031] (3/4) Epoch 7, batch 1000, loss[loss=0.2246, simple_loss=0.3007, pruned_loss=0.07426, over 15864.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3066, pruned_loss=0.06858, over 12907982.64 frames. ], batch size: 43, lr: 5.48e-03, grad_scale: 32.0 2023-10-10 14:20:55,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=387053.3333333333, ans=0.2 2023-10-10 14:21:14,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=15.0 2023-10-10 14:21:24,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.648e+02 1.829e+02 2.078e+02 2.874e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-10 14:21:24,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387193.3333333333, ans=0.1 2023-10-10 14:21:32,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=387193.3333333333, ans=0.025 2023-10-10 14:21:36,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=387240.0, ans=0.125 2023-10-10 14:21:49,350 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:21:51,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387286.6666666667, ans=0.1 2023-10-10 14:21:57,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=387333.3333333333, ans=0.125 2023-10-10 14:22:14,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=387380.0, ans=0.0 2023-10-10 14:22:18,092 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-10-10 14:22:21,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387426.6666666667, ans=0.1 2023-10-10 14:22:36,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=387473.3333333333, ans=0.125 2023-10-10 14:22:42,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=387520.0, ans=0.0 2023-10-10 14:22:44,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=387520.0, ans=0.125 2023-10-10 14:22:46,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=387520.0, ans=0.0 2023-10-10 14:22:56,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=387566.6666666667, ans=0.0 2023-10-10 14:23:18,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=387660.0, ans=0.2 2023-10-10 14:23:20,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.778e+02 1.984e+02 2.177e+02 2.907e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-10 14:23:20,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=387660.0, ans=0.2 2023-10-10 14:23:26,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=387660.0, ans=0.0 2023-10-10 14:23:53,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=387753.3333333333, ans=0.125 2023-10-10 14:24:04,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=387800.0, ans=0.125 2023-10-10 14:24:28,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=387893.3333333333, ans=0.125 2023-10-10 14:24:34,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=387940.0, ans=0.125 2023-10-10 14:24:45,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387986.6666666667, ans=0.1 2023-10-10 14:24:48,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.00 vs. limit=10.0 2023-10-10 14:24:51,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=387986.6666666667, ans=0.125 2023-10-10 14:24:54,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-10-10 14:25:06,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=388080.0, ans=0.2 2023-10-10 14:25:19,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.277e+02 1.667e+02 1.871e+02 2.075e+02 3.092e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-10 14:25:27,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=388126.6666666667, ans=0.05 2023-10-10 14:26:00,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=388266.6666666667, ans=0.2 2023-10-10 14:26:27,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=388406.6666666667, ans=0.125 2023-10-10 14:26:40,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=12.0 2023-10-10 14:26:44,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=388453.3333333333, ans=0.035 2023-10-10 14:26:45,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2023-10-10 14:26:47,926 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.365e-02 2023-10-10 14:27:01,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=388546.6666666667, ans=0.125 2023-10-10 14:27:13,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.757e+02 1.959e+02 2.242e+02 3.249e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-10 14:27:19,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388593.3333333333, ans=0.1 2023-10-10 14:27:22,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=388640.0, ans=0.0 2023-10-10 14:27:26,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-10 14:28:11,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=388826.6666666667, ans=0.0 2023-10-10 14:28:12,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=388826.6666666667, ans=0.2 2023-10-10 14:28:30,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=388920.0, ans=0.125 2023-10-10 14:28:31,973 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:28:45,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.33 vs. limit=15.0 2023-10-10 14:28:55,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=389013.3333333333, ans=0.0 2023-10-10 14:29:03,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=389013.3333333333, ans=0.125 2023-10-10 14:29:10,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.680e+02 1.864e+02 2.185e+02 3.818e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-10 14:29:15,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=389060.0, ans=0.125 2023-10-10 14:29:19,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=389106.6666666667, ans=0.04949747468305833 2023-10-10 14:29:21,454 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:29:33,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=389153.3333333333, ans=0.125 2023-10-10 14:29:43,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=389153.3333333333, ans=0.0 2023-10-10 14:30:02,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=389246.6666666667, ans=0.0 2023-10-10 14:30:04,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-10-10 14:30:15,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-10-10 14:30:21,063 INFO [train.py:1031] (3/4) Epoch 7, batch 1500, loss[loss=0.2297, simple_loss=0.3158, pruned_loss=0.07185, over 16830.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3044, pruned_loss=0.06734, over 17307365.39 frames. ], batch size: 146, lr: 5.46e-03, grad_scale: 16.0 2023-10-10 14:30:25,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=389340.0, ans=0.0 2023-10-10 14:30:28,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=389340.0, ans=0.125 2023-10-10 14:30:35,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389386.6666666667, ans=0.1 2023-10-10 14:31:06,928 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.70 vs. limit=10.0 2023-10-10 14:31:12,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.713e+02 1.914e+02 2.273e+02 3.600e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-10 14:31:36,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-10-10 14:31:48,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.42 vs. limit=15.0 2023-10-10 14:32:03,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=389713.3333333333, ans=0.0 2023-10-10 14:32:04,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389713.3333333333, ans=0.125 2023-10-10 14:32:07,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389713.3333333333, ans=0.125 2023-10-10 14:32:40,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=389853.3333333333, ans=0.125 2023-10-10 14:33:00,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.46 vs. limit=15.0 2023-10-10 14:33:10,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2023-10-10 14:33:11,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.61 vs. limit=10.0 2023-10-10 14:33:11,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.690e+02 1.920e+02 2.313e+02 3.796e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-10 14:33:12,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=389993.3333333333, ans=0.125 2023-10-10 14:33:40,610 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:34:13,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=390273.3333333333, ans=0.125 2023-10-10 14:34:25,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=390320.0, ans=0.125 2023-10-10 14:34:29,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-10-10 14:34:46,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390413.3333333333, ans=0.1 2023-10-10 14:34:57,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.656e+02 1.811e+02 1.971e+02 2.533e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-10 14:35:02,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390460.0, ans=0.1 2023-10-10 14:35:18,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390506.6666666667, ans=0.125 2023-10-10 14:35:25,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=390553.3333333333, ans=0.125 2023-10-10 14:35:38,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-10-10 14:35:40,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=390600.0, ans=0.125 2023-10-10 14:35:42,147 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:35:58,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=390693.3333333333, ans=0.125 2023-10-10 14:36:07,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=390740.0, ans=0.0 2023-10-10 14:36:08,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2023-10-10 14:36:18,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.68 vs. limit=15.0 2023-10-10 14:36:21,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=390786.6666666667, ans=0.125 2023-10-10 14:36:47,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=390880.0, ans=0.125 2023-10-10 14:36:48,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=390880.0, ans=0.125 2023-10-10 14:36:53,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.596e+02 1.760e+02 2.051e+02 3.571e+02, threshold=3.520e+02, percent-clipped=0.0 2023-10-10 14:37:15,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-10 14:37:19,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=391020.0, ans=0.125 2023-10-10 14:37:20,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=391020.0, ans=0.0 2023-10-10 14:37:22,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=391020.0, ans=0.125 2023-10-10 14:37:23,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=391020.0, ans=0.125 2023-10-10 14:37:25,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=391066.6666666667, ans=0.0 2023-10-10 14:37:25,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.06 vs. limit=22.5 2023-10-10 14:38:25,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=391300.0, ans=0.125 2023-10-10 14:38:28,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.73 vs. limit=15.0 2023-10-10 14:38:47,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.764e+02 1.959e+02 2.185e+02 3.012e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-10 14:38:49,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=391393.3333333333, ans=0.0 2023-10-10 14:39:40,607 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:39:41,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-10-10 14:39:46,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=391580.0, ans=0.125 2023-10-10 14:39:52,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=391626.6666666667, ans=0.125 2023-10-10 14:40:04,721 INFO [train.py:1031] (3/4) Epoch 7, batch 2000, loss[loss=0.224, simple_loss=0.3181, pruned_loss=0.06497, over 17022.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3052, pruned_loss=0.06741, over 20754103.77 frames. ], batch size: 77, lr: 5.44e-03, grad_scale: 32.0 2023-10-10 14:40:15,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=391673.3333333333, ans=0.2 2023-10-10 14:40:54,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391813.3333333333, ans=0.1 2023-10-10 14:40:54,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=391813.3333333333, ans=0.0 2023-10-10 14:40:55,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=391813.3333333333, ans=0.0 2023-10-10 14:41:04,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.695e+02 1.890e+02 2.124e+02 3.384e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-10 14:41:31,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=391953.3333333333, ans=0.125 2023-10-10 14:41:43,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=392000.0, ans=0.125 2023-10-10 14:41:46,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=392000.0, ans=0.125 2023-10-10 14:42:00,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=392046.6666666667, ans=0.125 2023-10-10 14:42:18,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=392140.0, ans=0.125 2023-10-10 14:42:23,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=392140.0, ans=0.125 2023-10-10 14:42:29,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392140.0, ans=0.1 2023-10-10 14:43:12,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=392233.3333333333, ans=0.125 2023-10-10 14:43:18,403 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.71 vs. limit=12.0 2023-10-10 14:43:24,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=392280.0, ans=10.0 2023-10-10 14:43:26,587 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:43:27,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=392326.6666666667, ans=0.125 2023-10-10 14:43:29,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.598e+02 1.792e+02 2.086e+02 3.103e+02, threshold=3.585e+02, percent-clipped=0.0 2023-10-10 14:44:12,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392466.6666666667, ans=0.1 2023-10-10 14:44:19,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=392466.6666666667, ans=0.2 2023-10-10 14:44:28,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=392513.3333333333, ans=0.0 2023-10-10 14:44:29,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=392513.3333333333, ans=0.05 2023-10-10 14:44:55,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=392606.6666666667, ans=0.125 2023-10-10 14:44:55,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=392606.6666666667, ans=0.125 2023-10-10 14:45:02,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=392653.3333333333, ans=0.125 2023-10-10 14:45:08,790 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-10-10 14:45:12,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-10-10 14:45:18,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=392700.0, ans=0.125 2023-10-10 14:45:23,454 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:45:23,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392700.0, ans=0.1 2023-10-10 14:45:26,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=392746.6666666667, ans=0.0 2023-10-10 14:45:36,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=392746.6666666667, ans=0.125 2023-10-10 14:45:39,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.731e+02 1.921e+02 2.161e+02 3.047e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-10 14:45:39,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=392793.3333333333, ans=0.1 2023-10-10 14:45:53,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=392840.0, ans=0.125 2023-10-10 14:46:00,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=392886.6666666667, ans=0.1 2023-10-10 14:46:04,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=392886.6666666667, ans=0.07 2023-10-10 14:46:21,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=392980.0, ans=0.0 2023-10-10 14:46:28,628 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.03 vs. limit=22.5 2023-10-10 14:46:36,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393026.6666666667, ans=0.125 2023-10-10 14:46:48,307 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:46:50,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=393073.3333333333, ans=0.2 2023-10-10 14:47:02,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-10-10 14:47:21,610 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-10-10 14:47:29,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=393213.3333333333, ans=0.1 2023-10-10 14:47:38,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.753e+02 1.915e+02 2.104e+02 2.840e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-10 14:47:42,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=393260.0, ans=0.125 2023-10-10 14:47:56,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=393306.6666666667, ans=0.125 2023-10-10 14:48:28,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=393446.6666666667, ans=0.125 2023-10-10 14:48:31,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393493.3333333333, ans=0.1 2023-10-10 14:48:33,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.23 vs. limit=22.5 2023-10-10 14:48:37,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=393493.3333333333, ans=0.2 2023-10-10 14:48:45,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=393540.0, ans=10.0 2023-10-10 14:48:49,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=393540.0, ans=0.2 2023-10-10 14:48:50,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=393540.0, ans=0.0 2023-10-10 14:49:32,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.743e+02 1.943e+02 2.270e+02 3.085e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-10 14:49:35,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=393726.6666666667, ans=0.0 2023-10-10 14:50:01,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-10 14:50:02,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2023-10-10 14:50:11,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=393913.3333333333, ans=0.1 2023-10-10 14:50:11,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393913.3333333333, ans=0.1 2023-10-10 14:50:17,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=393913.3333333333, ans=0.0 2023-10-10 14:50:26,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.94 vs. limit=15.0 2023-10-10 14:50:33,208 INFO [train.py:1031] (3/4) Epoch 7, batch 2500, loss[loss=0.2004, simple_loss=0.2923, pruned_loss=0.0543, over 16939.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3055, pruned_loss=0.06754, over 23426532.76 frames. ], batch size: 93, lr: 5.43e-03, grad_scale: 32.0 2023-10-10 14:50:46,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.87 vs. limit=15.0 2023-10-10 14:50:47,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394053.3333333333, ans=0.125 2023-10-10 14:51:19,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.742e+02 1.915e+02 2.111e+02 3.351e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-10 14:51:30,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=394240.0, ans=0.125 2023-10-10 14:51:34,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=394240.0, ans=0.125 2023-10-10 14:51:35,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=394286.6666666667, ans=0.0 2023-10-10 14:51:35,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=394286.6666666667, ans=0.125 2023-10-10 14:51:37,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=394286.6666666667, ans=0.125 2023-10-10 14:51:59,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=394380.0, ans=0.0 2023-10-10 14:51:59,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=394380.0, ans=0.95 2023-10-10 14:53:01,839 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:53:01,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=394613.3333333333, ans=0.2 2023-10-10 14:53:03,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394613.3333333333, ans=0.1 2023-10-10 14:53:06,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394660.0, ans=0.1 2023-10-10 14:53:09,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2023-10-10 14:53:09,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.804e+02 1.956e+02 2.373e+02 3.236e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-10 14:53:15,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=394706.6666666667, ans=0.125 2023-10-10 14:53:40,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=394800.0, ans=0.125 2023-10-10 14:53:40,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=394800.0, ans=0.0 2023-10-10 14:53:40,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=394800.0, ans=0.1 2023-10-10 14:53:45,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=394800.0, ans=0.0 2023-10-10 14:53:51,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-10-10 14:53:57,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-10-10 14:54:04,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=394893.3333333333, ans=0.125 2023-10-10 14:54:25,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=394986.6666666667, ans=0.07 2023-10-10 14:54:34,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=394986.6666666667, ans=0.125 2023-10-10 14:54:44,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=395033.3333333333, ans=0.125 2023-10-10 14:55:03,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.723e+02 1.939e+02 2.192e+02 3.740e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-10 14:55:03,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=395126.6666666667, ans=0.125 2023-10-10 14:55:43,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=395266.6666666667, ans=0.09899494936611666 2023-10-10 14:55:52,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=395313.3333333333, ans=0.0 2023-10-10 14:56:18,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=395406.6666666667, ans=0.125 2023-10-10 14:56:26,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=395406.6666666667, ans=0.125 2023-10-10 14:56:27,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=395406.6666666667, ans=0.0 2023-10-10 14:56:29,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=22.5 2023-10-10 14:56:49,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=395500.0, ans=0.125 2023-10-10 14:57:11,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.631e+02 1.896e+02 2.111e+02 3.244e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-10 14:57:13,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.86 vs. limit=6.0 2023-10-10 14:57:17,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=395593.3333333333, ans=0.125 2023-10-10 14:57:21,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=395640.0, ans=0.0 2023-10-10 14:57:33,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395686.6666666667, ans=0.1 2023-10-10 14:57:57,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=395780.0, ans=0.125 2023-10-10 14:58:01,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=395780.0, ans=0.2 2023-10-10 14:58:04,143 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:58:38,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395920.0, ans=0.1 2023-10-10 14:58:47,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-10 14:59:00,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-10-10 14:59:15,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.654e+02 1.915e+02 2.150e+02 3.116e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-10 14:59:16,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2023-10-10 14:59:18,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396060.0, ans=0.1 2023-10-10 14:59:19,917 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:59:26,236 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:00:04,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=396246.6666666667, ans=0.125 2023-10-10 15:00:16,812 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:00:16,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=396340.0, ans=0.2 2023-10-10 15:00:17,544 INFO [train.py:1031] (3/4) Epoch 7, batch 3000, loss[loss=0.2183, simple_loss=0.3076, pruned_loss=0.06443, over 16852.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3047, pruned_loss=0.06753, over 25499464.78 frames. ], batch size: 175, lr: 5.41e-03, grad_scale: 16.0 2023-10-10 15:00:19,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=396340.0, ans=0.125 2023-10-10 15:00:40,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=15.0 2023-10-10 15:00:40,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=396433.3333333333, ans=0.125 2023-10-10 15:01:09,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.764e+02 2.003e+02 2.343e+02 3.849e+02, threshold=4.007e+02, percent-clipped=1.0 2023-10-10 15:01:09,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=396526.6666666667, ans=0.0 2023-10-10 15:01:12,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=396526.6666666667, ans=0.0 2023-10-10 15:01:13,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-10-10 15:01:22,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=22.5 2023-10-10 15:01:35,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=396620.0, ans=0.0 2023-10-10 15:01:40,420 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:01:44,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396666.6666666667, ans=0.0 2023-10-10 15:01:45,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396666.6666666667, ans=0.125 2023-10-10 15:02:26,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=396853.3333333333, ans=0.0 2023-10-10 15:02:41,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.64 vs. limit=22.5 2023-10-10 15:02:48,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=22.5 2023-10-10 15:02:48,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396900.0, ans=0.1 2023-10-10 15:03:01,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.25 vs. limit=22.5 2023-10-10 15:03:08,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.711e+02 1.845e+02 2.157e+02 2.806e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 15:03:54,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=397180.0, ans=0.125 2023-10-10 15:05:01,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=397413.3333333333, ans=0.1 2023-10-10 15:05:09,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.755e+02 2.084e+02 2.367e+02 3.510e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-10 15:05:13,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397460.0, ans=0.125 2023-10-10 15:05:13,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-10 15:05:15,823 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:05:20,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397506.6666666667, ans=0.1 2023-10-10 15:05:27,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=397506.6666666667, ans=0.2 2023-10-10 15:05:30,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=397506.6666666667, ans=10.0 2023-10-10 15:05:41,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=397553.3333333333, ans=0.125 2023-10-10 15:05:48,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=397600.0, ans=0.125 2023-10-10 15:05:53,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=397600.0, ans=0.125 2023-10-10 15:05:53,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=397600.0, ans=12.0 2023-10-10 15:06:06,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=12.0 2023-10-10 15:06:09,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=397646.6666666667, ans=0.125 2023-10-10 15:06:14,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=397693.3333333333, ans=0.125 2023-10-10 15:06:29,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397740.0, ans=0.1 2023-10-10 15:06:43,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=397786.6666666667, ans=0.025 2023-10-10 15:06:54,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=397833.3333333333, ans=0.5 2023-10-10 15:07:02,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-10-10 15:07:18,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.695e+02 1.892e+02 2.115e+02 2.615e+02, threshold=3.784e+02, percent-clipped=0.0 2023-10-10 15:07:23,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=397973.3333333333, ans=0.0 2023-10-10 15:07:27,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=397973.3333333333, ans=0.125 2023-10-10 15:07:36,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=398020.0, ans=0.125 2023-10-10 15:07:40,534 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-10-10 15:08:14,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=398160.0, ans=0.125 2023-10-10 15:09:12,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-10-10 15:09:19,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.669e+02 1.829e+02 2.198e+02 3.544e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-10 15:09:35,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398486.6666666667, ans=0.125 2023-10-10 15:09:45,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-10-10 15:10:15,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398626.6666666667, ans=0.1 2023-10-10 15:10:25,406 INFO [train.py:1031] (3/4) Epoch 7, batch 3500, loss[loss=0.2152, simple_loss=0.3007, pruned_loss=0.06489, over 16642.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3044, pruned_loss=0.0675, over 27127986.47 frames. ], batch size: 61, lr: 5.40e-03, grad_scale: 32.0 2023-10-10 15:10:25,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=398673.3333333333, ans=0.1 2023-10-10 15:10:28,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=398673.3333333333, ans=0.125 2023-10-10 15:10:50,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-10-10 15:11:14,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.691e+02 1.861e+02 2.108e+02 3.840e+02, threshold=3.722e+02, percent-clipped=1.0 2023-10-10 15:11:48,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-10-10 15:11:59,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=399000.0, ans=0.125 2023-10-10 15:12:05,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=399046.6666666667, ans=6.0 2023-10-10 15:12:19,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=399093.3333333333, ans=0.125 2023-10-10 15:12:30,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=399140.0, ans=0.2 2023-10-10 15:12:48,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=399186.6666666667, ans=0.2 2023-10-10 15:12:48,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=399186.6666666667, ans=0.125 2023-10-10 15:13:01,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=399233.3333333333, ans=0.0 2023-10-10 15:13:03,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=399233.3333333333, ans=0.125 2023-10-10 15:13:06,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=399280.0, ans=0.125 2023-10-10 15:13:25,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.652e+02 1.838e+02 2.072e+02 2.985e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-10 15:13:51,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=399420.0, ans=0.2 2023-10-10 15:13:56,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=399466.6666666667, ans=0.125 2023-10-10 15:14:09,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=399513.3333333333, ans=0.0 2023-10-10 15:14:17,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=399513.3333333333, ans=0.125 2023-10-10 15:14:24,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=399560.0, ans=0.09899494936611666 2023-10-10 15:14:40,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=399606.6666666667, ans=0.125 2023-10-10 15:14:44,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=399653.3333333333, ans=0.07 2023-10-10 15:15:29,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.269e+02 1.608e+02 1.726e+02 1.918e+02 2.632e+02, threshold=3.452e+02, percent-clipped=0.0 2023-10-10 15:15:55,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=399886.6666666667, ans=0.125 2023-10-10 15:16:00,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-10 15:16:13,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.71 vs. limit=15.0 2023-10-10 15:16:18,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=15.0 2023-10-10 15:16:22,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=400026.6666666667, ans=0.125 2023-10-10 15:16:32,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400073.3333333333, ans=0.1 2023-10-10 15:16:36,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=400073.3333333333, ans=0.0 2023-10-10 15:17:19,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-10-10 15:17:22,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=400260.0, ans=0.0 2023-10-10 15:17:25,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.660e+02 1.879e+02 2.069e+02 3.139e+02, threshold=3.758e+02, percent-clipped=0.0 2023-10-10 15:17:57,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=400400.0, ans=0.0 2023-10-10 15:18:03,193 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-10-10 15:18:14,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=400493.3333333333, ans=0.125 2023-10-10 15:18:53,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.65 vs. limit=22.5 2023-10-10 15:18:58,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=8.0 2023-10-10 15:19:13,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-10-10 15:19:15,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.617e+02 1.788e+02 2.094e+02 2.634e+02, threshold=3.576e+02, percent-clipped=0.0 2023-10-10 15:19:16,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400726.6666666667, ans=0.1 2023-10-10 15:19:36,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=400820.0, ans=0.1 2023-10-10 15:20:01,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=400913.3333333333, ans=0.04949747468305833 2023-10-10 15:20:01,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=400913.3333333333, ans=0.125 2023-10-10 15:20:07,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=400913.3333333333, ans=0.125 2023-10-10 15:20:22,423 INFO [train.py:1031] (3/4) Epoch 7, batch 4000, loss[loss=0.2137, simple_loss=0.3026, pruned_loss=0.06237, over 16796.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3038, pruned_loss=0.06745, over 28380879.40 frames. ], batch size: 146, lr: 5.38e-03, grad_scale: 32.0 2023-10-10 15:20:41,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.00 vs. limit=15.0 2023-10-10 15:20:44,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=401053.3333333333, ans=0.125 2023-10-10 15:20:51,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=401100.0, ans=0.125 2023-10-10 15:21:05,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=401146.6666666667, ans=0.125 2023-10-10 15:21:07,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=401146.6666666667, ans=0.025 2023-10-10 15:21:18,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.801e+02 2.071e+02 2.423e+02 4.077e+02, threshold=4.142e+02, percent-clipped=2.0 2023-10-10 15:21:18,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=401193.3333333333, ans=0.05 2023-10-10 15:21:30,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=401240.0, ans=0.125 2023-10-10 15:21:55,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=401333.3333333333, ans=0.125 2023-10-10 15:21:56,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=401333.3333333333, ans=0.2 2023-10-10 15:22:03,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=401333.3333333333, ans=10.0 2023-10-10 15:22:24,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=401426.6666666667, ans=0.05 2023-10-10 15:22:28,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401473.3333333333, ans=0.0 2023-10-10 15:22:29,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-10-10 15:22:51,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=401566.6666666667, ans=0.0 2023-10-10 15:23:21,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.844e+02 2.114e+02 2.388e+02 3.529e+02, threshold=4.228e+02, percent-clipped=0.0 2023-10-10 15:23:30,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=22.5 2023-10-10 15:23:34,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=401706.6666666667, ans=0.125 2023-10-10 15:23:41,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=401753.3333333333, ans=0.1 2023-10-10 15:23:51,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=401753.3333333333, ans=0.0 2023-10-10 15:24:24,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=401846.6666666667, ans=0.0 2023-10-10 15:24:44,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=401940.0, ans=0.125 2023-10-10 15:24:48,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=401940.0, ans=0.125 2023-10-10 15:24:49,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=401940.0, ans=0.07 2023-10-10 15:24:54,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=22.5 2023-10-10 15:25:32,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.692e+02 1.940e+02 2.216e+02 3.151e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 15:25:37,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.27 vs. limit=15.0 2023-10-10 15:25:48,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=402220.0, ans=0.0 2023-10-10 15:26:09,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402313.3333333333, ans=0.1 2023-10-10 15:26:24,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=402360.0, ans=0.125 2023-10-10 15:26:34,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402406.6666666667, ans=0.1 2023-10-10 15:26:36,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=402406.6666666667, ans=0.125 2023-10-10 15:26:46,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402453.3333333333, ans=0.1 2023-10-10 15:26:51,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=402453.3333333333, ans=0.2 2023-10-10 15:26:53,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=402453.3333333333, ans=0.125 2023-10-10 15:26:54,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.88 vs. limit=15.0 2023-10-10 15:27:23,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.796e+02 2.012e+02 2.328e+02 3.393e+02, threshold=4.024e+02, percent-clipped=0.0 2023-10-10 15:27:24,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.40 vs. limit=10.0 2023-10-10 15:27:28,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=402640.0, ans=0.0 2023-10-10 15:27:35,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=402640.0, ans=0.0 2023-10-10 15:27:41,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=402686.6666666667, ans=0.0 2023-10-10 15:28:06,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=402780.0, ans=0.0 2023-10-10 15:28:08,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-10-10 15:28:14,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=402826.6666666667, ans=0.125 2023-10-10 15:28:18,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=402826.6666666667, ans=0.2 2023-10-10 15:28:25,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=402873.3333333333, ans=0.0 2023-10-10 15:28:29,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=402873.3333333333, ans=0.2 2023-10-10 15:28:36,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402920.0, ans=0.1 2023-10-10 15:28:47,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-10-10 15:29:02,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=403013.3333333333, ans=0.05 2023-10-10 15:29:29,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.882e+02 2.133e+02 2.523e+02 3.546e+02, threshold=4.266e+02, percent-clipped=0.0 2023-10-10 15:29:54,641 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:30:16,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=15.0 2023-10-10 15:30:26,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.65 vs. limit=10.0 2023-10-10 15:30:29,207 INFO [train.py:1031] (3/4) Epoch 7, batch 4500, loss[loss=0.2144, simple_loss=0.2695, pruned_loss=0.07967, over 12761.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.304, pruned_loss=0.06726, over 29347079.67 frames. ], batch size: 440, lr: 5.36e-03, grad_scale: 32.0 2023-10-10 15:30:34,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=403340.0, ans=0.125 2023-10-10 15:31:17,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=403526.6666666667, ans=0.125 2023-10-10 15:31:19,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.716e+02 1.912e+02 2.204e+02 3.218e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-10 15:31:28,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=403573.3333333333, ans=0.125 2023-10-10 15:31:37,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=403620.0, ans=0.1 2023-10-10 15:31:48,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=403666.6666666667, ans=0.0 2023-10-10 15:31:54,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=403713.3333333333, ans=0.0 2023-10-10 15:32:13,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=403760.0, ans=0.125 2023-10-10 15:32:31,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=403853.3333333333, ans=0.0 2023-10-10 15:32:37,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=403900.0, ans=0.2 2023-10-10 15:33:02,168 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-10-10 15:33:05,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.731e+02 1.887e+02 2.187e+02 3.727e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-10 15:33:13,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=404040.0, ans=0.0 2023-10-10 15:33:22,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=404086.6666666667, ans=0.0 2023-10-10 15:33:24,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-10-10 15:33:33,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2023-10-10 15:33:39,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=404133.3333333333, ans=0.2 2023-10-10 15:34:04,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=404226.6666666667, ans=0.125 2023-10-10 15:34:15,238 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:34:21,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=15.0 2023-10-10 15:34:23,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-10-10 15:34:36,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=404366.6666666667, ans=0.1 2023-10-10 15:34:41,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=404413.3333333333, ans=0.125 2023-10-10 15:34:54,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.754e+02 1.949e+02 2.290e+02 3.545e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-10 15:35:13,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.60 vs. limit=10.0 2023-10-10 15:35:14,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=404553.3333333333, ans=10.0 2023-10-10 15:35:20,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404600.0, ans=0.125 2023-10-10 15:35:32,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=404646.6666666667, ans=0.0 2023-10-10 15:35:36,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=404646.6666666667, ans=0.0 2023-10-10 15:35:47,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=404693.3333333333, ans=0.0 2023-10-10 15:35:56,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=404740.0, ans=0.0 2023-10-10 15:36:06,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404786.6666666667, ans=0.1 2023-10-10 15:36:22,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=404833.3333333333, ans=10.0 2023-10-10 15:36:36,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=404880.0, ans=0.125 2023-10-10 15:36:41,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2023-10-10 15:36:47,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=404926.6666666667, ans=0.0 2023-10-10 15:36:48,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.738e+02 1.924e+02 2.160e+02 2.956e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-10 15:36:49,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=404926.6666666667, ans=0.2 2023-10-10 15:36:59,534 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:37:01,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.83 vs. limit=22.5 2023-10-10 15:37:02,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=405020.0, ans=0.125 2023-10-10 15:37:29,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=405113.3333333333, ans=0.2 2023-10-10 15:37:46,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=405206.6666666667, ans=0.125 2023-10-10 15:37:54,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405206.6666666667, ans=0.1 2023-10-10 15:38:08,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=405253.3333333333, ans=0.125 2023-10-10 15:38:21,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=405300.0, ans=0.125 2023-10-10 15:38:43,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=405393.3333333333, ans=0.0 2023-10-10 15:38:44,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.704e+02 1.998e+02 2.408e+02 3.552e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-10 15:38:59,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=405486.6666666667, ans=0.2 2023-10-10 15:39:04,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=405486.6666666667, ans=0.0 2023-10-10 15:39:08,860 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:39:08,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405486.6666666667, ans=0.1 2023-10-10 15:39:12,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-10-10 15:39:23,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=405580.0, ans=0.0 2023-10-10 15:39:44,384 INFO [train.py:1031] (3/4) Epoch 7, batch 5000, loss[loss=0.2321, simple_loss=0.3107, pruned_loss=0.07681, over 16641.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.304, pruned_loss=0.06732, over 30124431.38 frames. ], batch size: 241, lr: 5.35e-03, grad_scale: 32.0 2023-10-10 15:39:58,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=405720.0, ans=0.125 2023-10-10 15:40:00,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=405720.0, ans=0.1 2023-10-10 15:40:01,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.28 vs. limit=10.0 2023-10-10 15:40:16,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=405766.6666666667, ans=0.0 2023-10-10 15:40:21,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=405813.3333333333, ans=0.125 2023-10-10 15:40:36,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.739e+02 1.937e+02 2.248e+02 3.255e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-10 15:40:40,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=405906.6666666667, ans=0.95 2023-10-10 15:40:41,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405906.6666666667, ans=0.1 2023-10-10 15:40:42,674 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-10-10 15:40:45,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=405906.6666666667, ans=0.0 2023-10-10 15:40:46,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405906.6666666667, ans=0.1 2023-10-10 15:40:48,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2023-10-10 15:42:11,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=406233.3333333333, ans=0.0 2023-10-10 15:42:15,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-10-10 15:42:21,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=406280.0, ans=0.0 2023-10-10 15:42:32,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.795e+02 1.945e+02 2.585e+02 4.159e+02, threshold=3.891e+02, percent-clipped=7.0 2023-10-10 15:42:34,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=406326.6666666667, ans=0.125 2023-10-10 15:42:49,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=406420.0, ans=0.0 2023-10-10 15:42:56,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=406466.6666666667, ans=0.125 2023-10-10 15:43:12,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=406513.3333333333, ans=0.07 2023-10-10 15:43:16,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=406513.3333333333, ans=0.0 2023-10-10 15:43:33,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-10-10 15:43:36,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=406606.6666666667, ans=0.09899494936611666 2023-10-10 15:43:47,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-10-10 15:44:02,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-10-10 15:44:06,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-10-10 15:44:09,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=406793.3333333333, ans=0.0 2023-10-10 15:44:11,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.05 vs. limit=15.0 2023-10-10 15:44:16,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.742e+02 1.938e+02 2.206e+02 2.945e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-10 15:44:18,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.13 vs. limit=15.0 2023-10-10 15:44:45,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406886.6666666667, ans=0.1 2023-10-10 15:44:56,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.08 vs. limit=15.0 2023-10-10 15:45:08,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=406980.0, ans=0.125 2023-10-10 15:45:18,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=407026.6666666667, ans=0.125 2023-10-10 15:45:18,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=407026.6666666667, ans=0.0 2023-10-10 15:45:30,569 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:45:35,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=407120.0, ans=0.0 2023-10-10 15:45:45,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=407166.6666666667, ans=0.07 2023-10-10 15:46:06,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=407213.3333333333, ans=0.09899494936611666 2023-10-10 15:46:14,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=407260.0, ans=0.2 2023-10-10 15:46:18,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.733e+02 1.916e+02 2.188e+02 3.258e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-10 15:46:27,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=407306.6666666667, ans=0.2 2023-10-10 15:46:36,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=407353.3333333333, ans=0.2 2023-10-10 15:46:40,267 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:46:41,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=407353.3333333333, ans=0.0 2023-10-10 15:46:53,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=407400.0, ans=0.2 2023-10-10 15:47:14,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=407493.3333333333, ans=0.125 2023-10-10 15:47:23,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=407540.0, ans=0.04949747468305833 2023-10-10 15:47:52,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=407680.0, ans=0.125 2023-10-10 15:47:56,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=407680.0, ans=0.0 2023-10-10 15:48:01,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=407726.6666666667, ans=0.125 2023-10-10 15:48:07,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.642e+02 1.810e+02 2.061e+02 2.865e+02, threshold=3.620e+02, percent-clipped=0.0 2023-10-10 15:48:16,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=407773.3333333333, ans=0.025 2023-10-10 15:48:25,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407820.0, ans=0.1 2023-10-10 15:48:26,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407820.0, ans=0.1 2023-10-10 15:48:40,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=407866.6666666667, ans=0.0 2023-10-10 15:48:56,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=407960.0, ans=0.125 2023-10-10 15:49:02,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=407960.0, ans=0.125 2023-10-10 15:49:06,006 INFO [train.py:1031] (3/4) Epoch 7, batch 5500, loss[loss=0.2031, simple_loss=0.2924, pruned_loss=0.05689, over 16849.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3037, pruned_loss=0.06708, over 30720065.67 frames. ], batch size: 87, lr: 5.33e-03, grad_scale: 16.0 2023-10-10 15:49:11,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=408006.6666666667, ans=0.125 2023-10-10 15:49:18,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=408053.3333333333, ans=0.0 2023-10-10 15:49:22,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408053.3333333333, ans=0.1 2023-10-10 15:49:26,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.27 vs. limit=22.5 2023-10-10 15:49:43,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=408146.6666666667, ans=0.2 2023-10-10 15:49:55,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.711e+02 1.889e+02 2.163e+02 3.768e+02, threshold=3.778e+02, percent-clipped=1.0 2023-10-10 15:50:08,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=408286.6666666667, ans=0.2 2023-10-10 15:50:17,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=408286.6666666667, ans=0.125 2023-10-10 15:50:19,043 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:50:29,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=408380.0, ans=0.2 2023-10-10 15:50:41,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=408426.6666666667, ans=10.0 2023-10-10 15:50:42,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=408426.6666666667, ans=0.125 2023-10-10 15:50:44,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=408426.6666666667, ans=0.0 2023-10-10 15:50:54,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=408473.3333333333, ans=0.125 2023-10-10 15:51:14,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2023-10-10 15:51:18,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=408566.6666666667, ans=0.125 2023-10-10 15:51:19,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=408566.6666666667, ans=0.2 2023-10-10 15:51:23,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=408566.6666666667, ans=0.125 2023-10-10 15:51:28,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=408613.3333333333, ans=0.2 2023-10-10 15:51:44,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.648e+02 1.839e+02 2.057e+02 3.027e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-10 15:51:48,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=408706.6666666667, ans=0.125 2023-10-10 15:51:51,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=408706.6666666667, ans=0.125 2023-10-10 15:51:51,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=408706.6666666667, ans=0.125 2023-10-10 15:51:56,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=408706.6666666667, ans=0.0 2023-10-10 15:52:12,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=408800.0, ans=0.125 2023-10-10 15:52:46,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=408940.0, ans=0.125 2023-10-10 15:53:10,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.19 vs. limit=22.5 2023-10-10 15:53:37,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.679e+02 1.861e+02 1.994e+02 2.762e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-10 15:54:12,333 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-10-10 15:54:21,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=409313.3333333333, ans=0.2 2023-10-10 15:54:31,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=409360.0, ans=0.125 2023-10-10 15:54:39,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-10 15:55:02,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=409500.0, ans=0.0 2023-10-10 15:55:23,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=409593.3333333333, ans=0.2 2023-10-10 15:55:31,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.698e+02 1.887e+02 2.124e+02 4.294e+02, threshold=3.774e+02, percent-clipped=1.0 2023-10-10 15:56:15,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2023-10-10 15:56:18,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=409780.0, ans=0.0 2023-10-10 15:56:39,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=409873.3333333333, ans=0.125 2023-10-10 15:56:40,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=409920.0, ans=0.125 2023-10-10 15:57:13,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=410013.3333333333, ans=0.5 2023-10-10 15:57:22,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.749e+02 2.025e+02 2.331e+02 3.574e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-10 15:57:28,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=410106.6666666667, ans=0.0 2023-10-10 15:57:31,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=410106.6666666667, ans=0.0 2023-10-10 15:57:39,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=410153.3333333333, ans=0.025 2023-10-10 15:57:42,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=410153.3333333333, ans=0.125 2023-10-10 15:57:59,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=410246.6666666667, ans=0.0 2023-10-10 15:58:20,535 INFO [train.py:1031] (3/4) Epoch 7, batch 6000, loss[loss=0.2233, simple_loss=0.2817, pruned_loss=0.08244, over 12404.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3039, pruned_loss=0.06717, over 31176835.06 frames. ], batch size: 440, lr: 5.32e-03, grad_scale: 32.0 2023-10-10 15:58:28,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=410340.0, ans=0.05 2023-10-10 15:58:37,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=410386.6666666667, ans=0.0 2023-10-10 15:58:39,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=410386.6666666667, ans=15.0 2023-10-10 15:58:41,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=410386.6666666667, ans=0.125 2023-10-10 15:58:42,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=410433.3333333333, ans=0.0 2023-10-10 15:58:49,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.38 vs. limit=15.0 2023-10-10 15:59:01,872 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2023-10-10 15:59:08,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=410526.6666666667, ans=0.0 2023-10-10 15:59:12,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.26 vs. limit=15.0 2023-10-10 15:59:15,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.766e+02 1.906e+02 2.058e+02 2.859e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-10 15:59:50,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=6.0 2023-10-10 16:00:29,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=410853.3333333333, ans=0.125 2023-10-10 16:00:34,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410853.3333333333, ans=0.1 2023-10-10 16:01:06,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.242e+02 1.683e+02 1.914e+02 2.126e+02 3.094e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-10 16:01:07,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-10-10 16:01:45,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411180.0, ans=0.1 2023-10-10 16:01:48,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=411180.0, ans=0.0 2023-10-10 16:01:50,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=411180.0, ans=0.125 2023-10-10 16:01:59,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.31 vs. limit=22.5 2023-10-10 16:02:04,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=411273.3333333333, ans=0.125 2023-10-10 16:02:13,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-10-10 16:02:16,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411320.0, ans=0.125 2023-10-10 16:02:56,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.839e+02 2.120e+02 2.440e+02 3.944e+02, threshold=4.241e+02, percent-clipped=1.0 2023-10-10 16:03:02,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=411506.6666666667, ans=0.0 2023-10-10 16:03:04,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-10 16:03:08,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411506.6666666667, ans=0.125 2023-10-10 16:03:16,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=411553.3333333333, ans=0.125 2023-10-10 16:03:44,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=411693.3333333333, ans=0.0 2023-10-10 16:03:56,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411740.0, ans=0.125 2023-10-10 16:03:57,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=411740.0, ans=0.2 2023-10-10 16:04:49,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.768e+02 1.961e+02 2.265e+02 3.462e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-10 16:05:14,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=412020.0, ans=0.0 2023-10-10 16:05:21,962 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-10-10 16:05:41,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=412113.3333333333, ans=0.125 2023-10-10 16:05:54,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.16 vs. limit=15.0 2023-10-10 16:06:03,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=412206.6666666667, ans=0.125 2023-10-10 16:06:12,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=412253.3333333333, ans=0.0 2023-10-10 16:06:31,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412346.6666666667, ans=0.1 2023-10-10 16:06:34,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-10-10 16:06:46,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.628e+02 1.866e+02 2.138e+02 3.487e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-10 16:06:52,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=412440.0, ans=0.0 2023-10-10 16:07:10,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412533.3333333333, ans=0.1 2023-10-10 16:07:44,799 INFO [train.py:1031] (3/4) Epoch 7, batch 6500, loss[loss=0.2407, simple_loss=0.3249, pruned_loss=0.07821, over 16848.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3041, pruned_loss=0.06722, over 31543152.22 frames. ], batch size: 146, lr: 5.30e-03, grad_scale: 32.0 2023-10-10 16:07:47,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=412673.3333333333, ans=0.1 2023-10-10 16:07:52,146 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-10-10 16:07:53,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=412673.3333333333, ans=0.125 2023-10-10 16:08:08,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=412720.0, ans=0.125 2023-10-10 16:08:26,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=412813.3333333333, ans=0.125 2023-10-10 16:08:34,629 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:08:50,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.810e+02 1.926e+02 2.168e+02 3.611e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-10 16:08:59,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=412906.6666666667, ans=0.95 2023-10-10 16:09:05,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=412953.3333333333, ans=0.125 2023-10-10 16:09:05,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=412953.3333333333, ans=0.125 2023-10-10 16:09:35,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=413046.6666666667, ans=0.1 2023-10-10 16:09:43,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.53 vs. limit=22.5 2023-10-10 16:09:45,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=413093.3333333333, ans=15.0 2023-10-10 16:09:58,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=413140.0, ans=0.0 2023-10-10 16:10:01,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=413186.6666666667, ans=0.0 2023-10-10 16:10:08,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=413186.6666666667, ans=0.2 2023-10-10 16:10:18,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=413233.3333333333, ans=0.0 2023-10-10 16:10:23,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=413280.0, ans=0.125 2023-10-10 16:10:42,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.727e+02 1.947e+02 2.416e+02 4.561e+02, threshold=3.895e+02, percent-clipped=1.0 2023-10-10 16:10:59,166 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:11:07,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=413466.6666666667, ans=0.05 2023-10-10 16:11:15,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.26 vs. limit=15.0 2023-10-10 16:11:33,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=413560.0, ans=0.125 2023-10-10 16:11:38,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=413606.6666666667, ans=0.125 2023-10-10 16:11:50,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=413653.3333333333, ans=15.0 2023-10-10 16:11:51,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=413653.3333333333, ans=0.125 2023-10-10 16:11:52,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=413653.3333333333, ans=0.2 2023-10-10 16:12:09,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-10-10 16:12:16,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=15.0 2023-10-10 16:12:22,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.26 vs. limit=15.0 2023-10-10 16:12:23,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=413793.3333333333, ans=0.125 2023-10-10 16:12:32,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.746e+02 1.903e+02 2.149e+02 3.646e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 16:12:50,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=413886.6666666667, ans=0.125 2023-10-10 16:12:59,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=413933.3333333333, ans=0.125 2023-10-10 16:13:02,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=15.0 2023-10-10 16:13:09,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=413980.0, ans=0.2 2023-10-10 16:13:09,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=413980.0, ans=0.125 2023-10-10 16:13:09,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=413980.0, ans=0.2 2023-10-10 16:13:11,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=413980.0, ans=0.125 2023-10-10 16:13:14,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-10-10 16:13:20,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=413980.0, ans=0.125 2023-10-10 16:13:22,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=22.5 2023-10-10 16:13:25,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=414026.6666666667, ans=0.125 2023-10-10 16:13:25,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=414026.6666666667, ans=0.05 2023-10-10 16:13:25,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=414026.6666666667, ans=0.2 2023-10-10 16:14:19,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414166.6666666667, ans=0.1 2023-10-10 16:14:35,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=414260.0, ans=0.2 2023-10-10 16:14:43,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.620e+02 1.783e+02 2.073e+02 2.507e+02, threshold=3.566e+02, percent-clipped=0.0 2023-10-10 16:14:48,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=414306.6666666667, ans=0.125 2023-10-10 16:15:16,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=414400.0, ans=0.125 2023-10-10 16:15:25,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=414446.6666666667, ans=0.125 2023-10-10 16:15:44,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=414540.0, ans=0.125 2023-10-10 16:15:49,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=414540.0, ans=0.2 2023-10-10 16:15:57,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414586.6666666667, ans=0.1 2023-10-10 16:16:07,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414633.3333333333, ans=0.1 2023-10-10 16:16:18,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414680.0, ans=0.1 2023-10-10 16:16:29,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=414726.6666666667, ans=0.125 2023-10-10 16:16:37,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.708e+02 1.919e+02 2.096e+02 3.386e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-10 16:16:44,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=414773.3333333333, ans=0.2 2023-10-10 16:17:16,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414913.3333333333, ans=0.1 2023-10-10 16:17:31,358 INFO [train.py:1031] (3/4) Epoch 7, batch 7000, loss[loss=0.2273, simple_loss=0.3118, pruned_loss=0.0714, over 16856.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3045, pruned_loss=0.06708, over 31858498.89 frames. ], batch size: 175, lr: 5.29e-03, grad_scale: 32.0 2023-10-10 16:17:32,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=415006.6666666667, ans=0.0 2023-10-10 16:17:40,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415006.6666666667, ans=0.1 2023-10-10 16:17:51,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=415053.3333333333, ans=0.125 2023-10-10 16:18:09,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-10-10 16:18:27,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.716e+02 1.901e+02 2.262e+02 3.962e+02, threshold=3.802e+02, percent-clipped=1.0 2023-10-10 16:18:27,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=415193.3333333333, ans=0.2 2023-10-10 16:18:37,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=415240.0, ans=0.0 2023-10-10 16:18:50,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=415333.3333333333, ans=0.1 2023-10-10 16:19:01,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=415380.0, ans=0.0 2023-10-10 16:19:05,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-10-10 16:19:36,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415520.0, ans=0.1 2023-10-10 16:19:39,860 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:19:55,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=415566.6666666667, ans=0.125 2023-10-10 16:19:59,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=415613.3333333333, ans=0.0 2023-10-10 16:20:07,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=415613.3333333333, ans=0.0 2023-10-10 16:20:19,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.765e+02 2.001e+02 2.380e+02 3.440e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-10 16:20:24,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=15.0 2023-10-10 16:20:25,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=415706.6666666667, ans=0.0 2023-10-10 16:20:33,531 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-10-10 16:20:40,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=415753.3333333333, ans=0.125 2023-10-10 16:20:42,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=415800.0, ans=0.125 2023-10-10 16:20:42,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=415800.0, ans=0.0 2023-10-10 16:20:56,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-10-10 16:21:07,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=415893.3333333333, ans=0.125 2023-10-10 16:21:11,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=415893.3333333333, ans=0.025 2023-10-10 16:21:17,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=415940.0, ans=0.2 2023-10-10 16:21:53,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.58 vs. limit=22.5 2023-10-10 16:21:58,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-10 16:22:24,478 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=15.0 2023-10-10 16:22:26,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.669e+02 1.836e+02 2.091e+02 3.500e+02, threshold=3.672e+02, percent-clipped=0.0 2023-10-10 16:22:40,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=416173.3333333333, ans=0.0 2023-10-10 16:22:46,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416220.0, ans=0.1 2023-10-10 16:22:57,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-10-10 16:23:44,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=416453.3333333333, ans=0.0 2023-10-10 16:24:07,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=416500.0, ans=0.125 2023-10-10 16:24:07,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=416500.0, ans=0.0 2023-10-10 16:24:31,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.655e+02 1.843e+02 2.140e+02 2.855e+02, threshold=3.686e+02, percent-clipped=0.0 2023-10-10 16:24:39,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=416640.0, ans=0.125 2023-10-10 16:24:44,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=416686.6666666667, ans=0.02 2023-10-10 16:25:10,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=416780.0, ans=0.0 2023-10-10 16:25:19,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=416826.6666666667, ans=0.125 2023-10-10 16:25:24,290 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=22.5 2023-10-10 16:25:40,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=416873.3333333333, ans=0.2 2023-10-10 16:26:06,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=417013.3333333333, ans=0.125 2023-10-10 16:26:08,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2023-10-10 16:26:12,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-10-10 16:26:16,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417013.3333333333, ans=0.1 2023-10-10 16:26:18,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=417060.0, ans=0.2 2023-10-10 16:26:21,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=417060.0, ans=0.0 2023-10-10 16:26:29,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.702e+02 1.874e+02 2.042e+02 3.327e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-10 16:26:34,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=12.0 2023-10-10 16:26:41,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=417153.3333333333, ans=0.0 2023-10-10 16:26:53,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=417200.0, ans=0.125 2023-10-10 16:26:54,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=417200.0, ans=0.125 2023-10-10 16:27:00,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2023-10-10 16:27:21,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417293.3333333333, ans=0.1 2023-10-10 16:27:23,655 INFO [train.py:1031] (3/4) Epoch 7, batch 7500, loss[loss=0.198, simple_loss=0.2847, pruned_loss=0.05565, over 15348.00 frames. ], tot_loss[loss=0.219, simple_loss=0.304, pruned_loss=0.06698, over 32040013.30 frames. ], batch size: 35, lr: 5.27e-03, grad_scale: 16.0 2023-10-10 16:27:34,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=417386.6666666667, ans=0.0 2023-10-10 16:27:53,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=417433.3333333333, ans=0.125 2023-10-10 16:27:57,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=417433.3333333333, ans=0.0 2023-10-10 16:28:20,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=12.0 2023-10-10 16:28:24,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.759e+02 2.037e+02 2.394e+02 3.927e+02, threshold=4.075e+02, percent-clipped=1.0 2023-10-10 16:28:35,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=417573.3333333333, ans=0.0 2023-10-10 16:28:51,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417666.6666666667, ans=0.1 2023-10-10 16:28:52,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=417666.6666666667, ans=0.04949747468305833 2023-10-10 16:28:58,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=417666.6666666667, ans=0.0 2023-10-10 16:29:06,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=417713.3333333333, ans=0.1 2023-10-10 16:29:10,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.59 vs. limit=22.5 2023-10-10 16:29:24,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.20 vs. limit=15.0 2023-10-10 16:29:27,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=417806.6666666667, ans=0.0 2023-10-10 16:29:37,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=417853.3333333333, ans=0.125 2023-10-10 16:29:45,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=417900.0, ans=0.125 2023-10-10 16:29:45,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-10 16:29:56,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=417900.0, ans=10.0 2023-10-10 16:30:08,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=417946.6666666667, ans=0.125 2023-10-10 16:30:24,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=417993.3333333333, ans=0.125 2023-10-10 16:30:31,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.707e+02 1.888e+02 2.168e+02 3.818e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-10 16:30:33,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.14 vs. limit=22.5 2023-10-10 16:30:38,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=418040.0, ans=0.2 2023-10-10 16:30:48,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=418086.6666666667, ans=0.05 2023-10-10 16:30:51,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=418086.6666666667, ans=0.0 2023-10-10 16:30:59,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=418133.3333333333, ans=0.0 2023-10-10 16:31:13,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.98 vs. limit=10.0 2023-10-10 16:31:41,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=418273.3333333333, ans=0.09899494936611666 2023-10-10 16:31:46,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=418320.0, ans=0.125 2023-10-10 16:31:53,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=418320.0, ans=0.125 2023-10-10 16:31:57,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=418366.6666666667, ans=0.0 2023-10-10 16:32:03,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-10-10 16:32:14,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=418413.3333333333, ans=10.0 2023-10-10 16:32:23,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=418460.0, ans=0.125 2023-10-10 16:32:33,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.283e+02 1.738e+02 1.979e+02 2.253e+02 5.503e+02, threshold=3.958e+02, percent-clipped=2.0 2023-10-10 16:32:35,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418506.6666666667, ans=0.1 2023-10-10 16:33:06,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=418646.6666666667, ans=0.125 2023-10-10 16:33:09,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=418646.6666666667, ans=0.125 2023-10-10 16:33:09,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=418646.6666666667, ans=0.02 2023-10-10 16:33:16,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=418693.3333333333, ans=0.0 2023-10-10 16:33:44,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=418786.6666666667, ans=0.2 2023-10-10 16:33:47,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=418786.6666666667, ans=0.1 2023-10-10 16:33:51,369 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-10-10 16:34:05,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=418833.3333333333, ans=0.0 2023-10-10 16:34:11,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=418880.0, ans=0.125 2023-10-10 16:34:26,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=418926.6666666667, ans=0.2 2023-10-10 16:34:30,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=418926.6666666667, ans=10.0 2023-10-10 16:34:34,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.768e+02 1.946e+02 2.206e+02 2.921e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-10 16:35:02,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2023-10-10 16:35:13,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=419113.3333333333, ans=0.2 2023-10-10 16:35:16,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=419113.3333333333, ans=0.0 2023-10-10 16:35:21,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=419160.0, ans=0.0 2023-10-10 16:35:25,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=419160.0, ans=0.5 2023-10-10 16:35:33,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419206.6666666667, ans=0.1 2023-10-10 16:35:46,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=419253.3333333333, ans=0.125 2023-10-10 16:36:01,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=419300.0, ans=0.1 2023-10-10 16:36:03,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419346.6666666667, ans=0.1 2023-10-10 16:36:06,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=419346.6666666667, ans=0.0 2023-10-10 16:36:15,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=419393.3333333333, ans=0.125 2023-10-10 16:36:17,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=419393.3333333333, ans=0.0 2023-10-10 16:36:17,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419393.3333333333, ans=0.125 2023-10-10 16:36:25,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.618e+02 1.800e+02 2.097e+02 3.234e+02, threshold=3.600e+02, percent-clipped=0.0 2023-10-10 16:36:28,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=419440.0, ans=0.125 2023-10-10 16:36:44,433 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:37:26,148 INFO [train.py:1031] (3/4) Epoch 7, batch 8000, loss[loss=0.2124, simple_loss=0.2727, pruned_loss=0.07609, over 12840.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3035, pruned_loss=0.06646, over 32198573.55 frames. ], batch size: 440, lr: 5.26e-03, grad_scale: 32.0 2023-10-10 16:37:32,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=419673.3333333333, ans=0.2 2023-10-10 16:37:46,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-10-10 16:37:52,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-10-10 16:38:01,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=419813.3333333333, ans=0.125 2023-10-10 16:38:08,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=419813.3333333333, ans=0.0 2023-10-10 16:38:20,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.595e+02 1.698e+02 1.908e+02 3.006e+02, threshold=3.396e+02, percent-clipped=0.0 2023-10-10 16:38:21,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=419906.6666666667, ans=0.125 2023-10-10 16:38:22,770 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:38:24,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=419906.6666666667, ans=0.125 2023-10-10 16:38:28,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=419906.6666666667, ans=0.125 2023-10-10 16:38:33,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=419953.3333333333, ans=0.125 2023-10-10 16:38:44,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420000.0, ans=0.1 2023-10-10 16:38:54,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=420000.0, ans=0.125 2023-10-10 16:38:58,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.79 vs. limit=15.0 2023-10-10 16:39:16,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=420093.3333333333, ans=0.125 2023-10-10 16:39:30,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=420186.6666666667, ans=0.2 2023-10-10 16:39:47,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=420233.3333333333, ans=0.1 2023-10-10 16:39:47,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.20 vs. limit=22.5 2023-10-10 16:39:48,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.76 vs. limit=10.0 2023-10-10 16:40:00,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=420280.0, ans=0.0 2023-10-10 16:40:19,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.775e+02 1.980e+02 2.191e+02 3.122e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-10 16:40:43,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=420420.0, ans=0.0 2023-10-10 16:40:44,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=420420.0, ans=0.125 2023-10-10 16:40:46,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=420420.0, ans=0.125 2023-10-10 16:40:52,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420420.0, ans=0.125 2023-10-10 16:41:13,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=420513.3333333333, ans=0.125 2023-10-10 16:41:16,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=420513.3333333333, ans=0.0 2023-10-10 16:41:33,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=420606.6666666667, ans=0.125 2023-10-10 16:41:35,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=420606.6666666667, ans=0.0 2023-10-10 16:41:36,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=420606.6666666667, ans=0.025 2023-10-10 16:41:50,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-10-10 16:41:56,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2023-10-10 16:42:00,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.97 vs. limit=10.0 2023-10-10 16:42:20,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=420746.6666666667, ans=0.2 2023-10-10 16:42:38,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.638e+02 1.821e+02 2.069e+02 3.139e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-10 16:42:52,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=420886.6666666667, ans=0.125 2023-10-10 16:42:52,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420886.6666666667, ans=0.125 2023-10-10 16:42:55,717 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:43:01,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=420933.3333333333, ans=0.0 2023-10-10 16:43:52,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=421120.0, ans=0.125 2023-10-10 16:43:55,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=421120.0, ans=0.125 2023-10-10 16:43:55,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=421120.0, ans=0.0 2023-10-10 16:44:07,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421166.6666666667, ans=0.1 2023-10-10 16:44:11,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=421166.6666666667, ans=0.125 2023-10-10 16:44:19,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421213.3333333333, ans=0.125 2023-10-10 16:44:30,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=421260.0, ans=0.2 2023-10-10 16:44:33,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=421260.0, ans=10.0 2023-10-10 16:44:38,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=421260.0, ans=0.0 2023-10-10 16:44:44,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.700e+02 1.904e+02 2.164e+02 3.559e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 16:44:46,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=421306.6666666667, ans=0.09899494936611666 2023-10-10 16:44:53,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421353.3333333333, ans=0.1 2023-10-10 16:44:54,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=421353.3333333333, ans=0.125 2023-10-10 16:44:54,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=421353.3333333333, ans=0.125 2023-10-10 16:44:54,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=421353.3333333333, ans=0.2 2023-10-10 16:44:55,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421353.3333333333, ans=0.1 2023-10-10 16:45:07,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=421400.0, ans=0.125 2023-10-10 16:45:13,688 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:45:21,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=421446.6666666667, ans=0.125 2023-10-10 16:45:41,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=421540.0, ans=0.0 2023-10-10 16:46:07,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.94 vs. limit=12.0 2023-10-10 16:46:11,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421633.3333333333, ans=0.1 2023-10-10 16:46:23,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=421680.0, ans=0.1 2023-10-10 16:46:44,056 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:46:44,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=15.0 2023-10-10 16:46:45,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.713e+02 1.887e+02 2.097e+02 3.418e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-10 16:47:00,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.59 vs. limit=15.0 2023-10-10 16:47:07,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=421866.6666666667, ans=0.125 2023-10-10 16:47:11,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-10 16:47:13,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=421866.6666666667, ans=0.125 2023-10-10 16:47:30,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-10 16:47:31,978 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.97 vs. limit=15.0 2023-10-10 16:47:47,107 INFO [train.py:1031] (3/4) Epoch 7, batch 8500, loss[loss=0.2152, simple_loss=0.3019, pruned_loss=0.06422, over 16549.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3036, pruned_loss=0.06616, over 32358701.84 frames. ], batch size: 56, lr: 5.25e-03, grad_scale: 32.0 2023-10-10 16:47:47,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=422006.6666666667, ans=0.2 2023-10-10 16:48:09,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=422053.3333333333, ans=0.07 2023-10-10 16:48:16,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=422100.0, ans=0.04949747468305833 2023-10-10 16:48:23,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=422146.6666666667, ans=0.2 2023-10-10 16:48:40,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=422193.3333333333, ans=0.025 2023-10-10 16:48:47,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.289e+02 1.693e+02 1.852e+02 2.060e+02 3.508e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 16:48:50,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=422240.0, ans=0.125 2023-10-10 16:48:50,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=422240.0, ans=0.125 2023-10-10 16:49:18,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=422333.3333333333, ans=0.125 2023-10-10 16:49:26,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.0 2023-10-10 16:49:29,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=422380.0, ans=0.125 2023-10-10 16:49:38,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=422426.6666666667, ans=0.125 2023-10-10 16:50:04,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=422473.3333333333, ans=0.2 2023-10-10 16:50:05,257 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:50:12,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=422520.0, ans=15.0 2023-10-10 16:50:13,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=422520.0, ans=0.0 2023-10-10 16:50:23,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=422566.6666666667, ans=0.04949747468305833 2023-10-10 16:50:31,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=422613.3333333333, ans=0.125 2023-10-10 16:50:48,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.10 vs. limit=10.0 2023-10-10 16:50:52,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=422660.0, ans=0.125 2023-10-10 16:50:58,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.621e+02 1.814e+02 2.057e+02 3.389e+02, threshold=3.628e+02, percent-clipped=0.0 2023-10-10 16:51:06,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=422706.6666666667, ans=0.125 2023-10-10 16:51:18,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=422753.3333333333, ans=0.125 2023-10-10 16:51:43,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-10-10 16:51:50,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=422893.3333333333, ans=0.2 2023-10-10 16:52:03,301 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:52:06,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=422940.0, ans=0.125 2023-10-10 16:52:28,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=422986.6666666667, ans=0.125 2023-10-10 16:52:36,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=423033.3333333333, ans=0.2 2023-10-10 16:52:40,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.49 vs. limit=10.0 2023-10-10 16:53:07,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=423126.6666666667, ans=0.0 2023-10-10 16:53:14,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.673e+02 1.829e+02 2.197e+02 3.613e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-10 16:53:35,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-10-10 16:54:02,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=423313.3333333333, ans=0.0 2023-10-10 16:54:12,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-10-10 16:54:14,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=423360.0, ans=0.125 2023-10-10 16:54:19,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=423406.6666666667, ans=0.125 2023-10-10 16:54:31,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=423453.3333333333, ans=0.125 2023-10-10 16:54:40,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=423500.0, ans=0.04949747468305833 2023-10-10 16:54:51,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423546.6666666667, ans=0.1 2023-10-10 16:54:53,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=423546.6666666667, ans=0.0 2023-10-10 16:54:57,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=423546.6666666667, ans=0.05 2023-10-10 16:54:59,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423546.6666666667, ans=0.1 2023-10-10 16:55:02,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=423593.3333333333, ans=0.0 2023-10-10 16:55:18,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.742e+02 2.092e+02 2.496e+02 3.647e+02, threshold=4.184e+02, percent-clipped=0.0 2023-10-10 16:55:25,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=423640.0, ans=0.0 2023-10-10 16:55:45,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423733.3333333333, ans=0.1 2023-10-10 16:56:17,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=423873.3333333333, ans=0.0 2023-10-10 16:56:32,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=423920.0, ans=0.125 2023-10-10 16:56:37,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.05 vs. limit=10.0 2023-10-10 16:56:51,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424013.3333333333, ans=0.1 2023-10-10 16:57:10,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.729e+02 1.875e+02 2.213e+02 2.767e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-10 16:57:14,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424106.6666666667, ans=0.1 2023-10-10 16:57:21,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-10-10 16:57:25,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=424153.3333333333, ans=0.0 2023-10-10 16:57:27,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-10-10 16:57:55,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=424246.6666666667, ans=0.04949747468305833 2023-10-10 16:58:01,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=424293.3333333333, ans=0.2 2023-10-10 16:58:07,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424340.0, ans=0.1 2023-10-10 16:58:07,711 INFO [train.py:1031] (3/4) Epoch 7, batch 9000, loss[loss=0.2351, simple_loss=0.3183, pruned_loss=0.07597, over 16891.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.303, pruned_loss=0.06577, over 32465968.04 frames. ], batch size: 110, lr: 5.23e-03, grad_scale: 32.0 2023-10-10 16:58:10,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424340.0, ans=0.1 2023-10-10 16:58:24,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=424386.6666666667, ans=0.0 2023-10-10 16:58:32,626 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:58:34,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=424433.3333333333, ans=0.125 2023-10-10 16:58:40,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.46 vs. limit=15.0 2023-10-10 16:58:48,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=424480.0, ans=0.125 2023-10-10 16:58:48,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-10-10 16:58:50,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=424480.0, ans=0.0 2023-10-10 16:58:56,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=424526.6666666667, ans=0.035 2023-10-10 16:59:01,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-10 16:59:03,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=12.0 2023-10-10 16:59:06,616 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.711e+02 1.861e+02 2.076e+02 3.072e+02, threshold=3.722e+02, percent-clipped=0.0 2023-10-10 16:59:12,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424573.3333333333, ans=0.125 2023-10-10 16:59:32,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.82 vs. limit=15.0 2023-10-10 16:59:44,250 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:59:51,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=424713.3333333333, ans=0.0 2023-10-10 17:00:14,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.11 vs. limit=15.0 2023-10-10 17:00:16,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=424853.3333333333, ans=0.2 2023-10-10 17:00:26,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=424900.0, ans=0.0 2023-10-10 17:00:30,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=424900.0, ans=0.2 2023-10-10 17:00:31,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=424900.0, ans=0.2 2023-10-10 17:00:51,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-10-10 17:01:01,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=425040.0, ans=0.0 2023-10-10 17:01:01,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.707e+02 1.847e+02 2.089e+02 2.824e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-10 17:01:27,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=425133.3333333333, ans=0.0 2023-10-10 17:01:29,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=425133.3333333333, ans=0.125 2023-10-10 17:01:38,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=425180.0, ans=0.0 2023-10-10 17:01:39,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.32 vs. limit=15.0 2023-10-10 17:01:42,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.46 vs. limit=15.0 2023-10-10 17:01:52,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425226.6666666667, ans=0.1 2023-10-10 17:02:02,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.02 vs. limit=15.0 2023-10-10 17:02:13,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=425320.0, ans=0.0 2023-10-10 17:02:13,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425320.0, ans=0.1 2023-10-10 17:02:21,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=425366.6666666667, ans=0.125 2023-10-10 17:02:30,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=425413.3333333333, ans=0.0 2023-10-10 17:02:32,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=425413.3333333333, ans=0.0 2023-10-10 17:02:46,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=425460.0, ans=0.0 2023-10-10 17:02:49,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=425460.0, ans=0.0 2023-10-10 17:02:52,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-10-10 17:02:54,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.786e+02 1.965e+02 2.289e+02 3.111e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-10 17:03:15,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=425600.0, ans=0.125 2023-10-10 17:03:41,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=425693.3333333333, ans=0.125 2023-10-10 17:03:59,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=425786.6666666667, ans=0.0 2023-10-10 17:04:04,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.01 vs. limit=15.0 2023-10-10 17:04:21,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=425833.3333333333, ans=0.2 2023-10-10 17:04:46,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=425926.6666666667, ans=0.125 2023-10-10 17:04:52,365 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.760e+02 1.964e+02 2.195e+02 3.007e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-10 17:04:57,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425973.3333333333, ans=0.1 2023-10-10 17:05:28,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=426066.6666666667, ans=0.0 2023-10-10 17:05:55,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.06 vs. limit=15.0 2023-10-10 17:06:12,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=22.5 2023-10-10 17:06:22,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=426300.0, ans=0.0 2023-10-10 17:06:37,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=426346.6666666667, ans=0.125 2023-10-10 17:06:49,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=426393.3333333333, ans=0.0 2023-10-10 17:06:53,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.694e+02 1.863e+02 2.135e+02 3.320e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-10 17:06:55,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=426440.0, ans=0.0 2023-10-10 17:06:59,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-10 17:07:26,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=426533.3333333333, ans=0.125 2023-10-10 17:07:35,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=426580.0, ans=0.0 2023-10-10 17:07:41,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=426580.0, ans=10.0 2023-10-10 17:07:46,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426626.6666666667, ans=0.125 2023-10-10 17:07:53,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.14 vs. limit=22.5 2023-10-10 17:07:55,268 INFO [train.py:1031] (3/4) Epoch 7, batch 9500, loss[loss=0.2279, simple_loss=0.3129, pruned_loss=0.07145, over 16874.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3039, pruned_loss=0.06618, over 32552058.54 frames. ], batch size: 110, lr: 5.22e-03, grad_scale: 32.0 2023-10-10 17:07:55,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=426673.3333333333, ans=0.0 2023-10-10 17:08:03,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=426673.3333333333, ans=10.0 2023-10-10 17:08:19,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426766.6666666667, ans=0.1 2023-10-10 17:08:21,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=426766.6666666667, ans=0.125 2023-10-10 17:08:22,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426766.6666666667, ans=0.1 2023-10-10 17:08:54,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=426906.6666666667, ans=0.0 2023-10-10 17:08:55,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.706e+02 1.940e+02 2.208e+02 2.982e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-10 17:09:03,865 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-10 17:09:47,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=427093.3333333333, ans=0.125 2023-10-10 17:10:03,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-10-10 17:10:38,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-10-10 17:10:51,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.680e+02 1.848e+02 2.237e+02 3.126e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-10 17:11:14,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=427466.6666666667, ans=0.125 2023-10-10 17:11:41,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=427560.0, ans=0.0 2023-10-10 17:11:48,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=427606.6666666667, ans=0.2 2023-10-10 17:11:57,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=427606.6666666667, ans=15.0 2023-10-10 17:12:20,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=427746.6666666667, ans=0.0 2023-10-10 17:12:44,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.637e+02 1.752e+02 2.013e+02 2.658e+02, threshold=3.504e+02, percent-clipped=0.0 2023-10-10 17:13:12,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=427933.3333333333, ans=0.0 2023-10-10 17:13:20,201 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:13:40,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=428026.6666666667, ans=0.0 2023-10-10 17:13:42,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=428026.6666666667, ans=0.125 2023-10-10 17:14:26,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428166.6666666667, ans=0.0 2023-10-10 17:14:26,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=428166.6666666667, ans=0.2 2023-10-10 17:14:30,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428166.6666666667, ans=0.1 2023-10-10 17:14:43,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=428260.0, ans=0.125 2023-10-10 17:14:44,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=428260.0, ans=0.0 2023-10-10 17:15:00,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.686e+02 1.836e+02 2.068e+02 3.231e+02, threshold=3.672e+02, percent-clipped=0.0 2023-10-10 17:16:14,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=428586.6666666667, ans=0.125 2023-10-10 17:16:19,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=428633.3333333333, ans=0.2 2023-10-10 17:16:33,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428680.0, ans=0.1 2023-10-10 17:16:36,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=428680.0, ans=0.125 2023-10-10 17:16:38,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=428680.0, ans=0.0 2023-10-10 17:16:39,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428680.0, ans=0.0 2023-10-10 17:16:41,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.58 vs. limit=15.0 2023-10-10 17:16:48,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=428726.6666666667, ans=0.125 2023-10-10 17:16:53,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.700e+02 1.849e+02 2.113e+02 3.177e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-10 17:16:55,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=428773.3333333333, ans=0.125 2023-10-10 17:17:11,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-10 17:17:36,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=428913.3333333333, ans=0.0 2023-10-10 17:17:49,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=429006.6666666667, ans=0.2 2023-10-10 17:17:49,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-10-10 17:17:50,777 INFO [train.py:1031] (3/4) Epoch 7, batch 10000, loss[loss=0.2229, simple_loss=0.3037, pruned_loss=0.07104, over 16871.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3027, pruned_loss=0.06553, over 32597600.58 frames. ], batch size: 116, lr: 5.20e-03, grad_scale: 32.0 2023-10-10 17:17:53,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.67 vs. limit=15.0 2023-10-10 17:18:20,718 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:18:37,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-10-10 17:18:40,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=429193.3333333333, ans=0.0 2023-10-10 17:18:48,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.293e+02 1.772e+02 2.004e+02 2.323e+02 3.394e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 17:18:57,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=429286.6666666667, ans=0.025 2023-10-10 17:18:58,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=429286.6666666667, ans=0.125 2023-10-10 17:19:13,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2023-10-10 17:19:25,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2023-10-10 17:19:40,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429426.6666666667, ans=0.1 2023-10-10 17:19:42,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=429473.3333333333, ans=0.0 2023-10-10 17:19:44,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=429473.3333333333, ans=0.1 2023-10-10 17:19:45,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=429473.3333333333, ans=0.0 2023-10-10 17:19:56,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.55 vs. limit=6.0 2023-10-10 17:20:40,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.634e+02 1.822e+02 1.991e+02 2.580e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-10 17:20:42,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=429706.6666666667, ans=0.0 2023-10-10 17:20:51,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=429753.3333333333, ans=0.0 2023-10-10 17:20:54,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=429753.3333333333, ans=0.125 2023-10-10 17:20:57,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=429753.3333333333, ans=0.125 2023-10-10 17:21:01,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=429800.0, ans=0.1 2023-10-10 17:21:10,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=429846.6666666667, ans=0.125 2023-10-10 17:21:29,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-10-10 17:22:02,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=430033.3333333333, ans=0.0 2023-10-10 17:22:11,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=430033.3333333333, ans=0.0 2023-10-10 17:22:16,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=430080.0, ans=0.125 2023-10-10 17:22:22,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=430080.0, ans=0.125 2023-10-10 17:22:26,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=430126.6666666667, ans=0.015 2023-10-10 17:22:37,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.731e+02 1.900e+02 2.200e+02 3.020e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-10 17:22:39,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-10-10 17:22:47,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=430173.3333333333, ans=0.0 2023-10-10 17:22:49,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=430220.0, ans=0.07 2023-10-10 17:22:55,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=430220.0, ans=0.125 2023-10-10 17:23:05,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=430266.6666666667, ans=0.0 2023-10-10 17:23:11,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-10-10 17:23:36,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=22.5 2023-10-10 17:24:06,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.67 vs. limit=6.0 2023-10-10 17:24:08,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=430500.0, ans=0.125 2023-10-10 17:24:13,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=430500.0, ans=0.2 2023-10-10 17:24:19,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=430546.6666666667, ans=0.125 2023-10-10 17:24:25,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=430546.6666666667, ans=0.2 2023-10-10 17:24:46,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.712e+02 1.934e+02 2.378e+02 3.327e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-10 17:24:55,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=430686.6666666667, ans=0.5 2023-10-10 17:25:02,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=430686.6666666667, ans=0.125 2023-10-10 17:25:08,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=430733.3333333333, ans=0.125 2023-10-10 17:25:37,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-10-10 17:25:40,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=430826.6666666667, ans=0.125 2023-10-10 17:25:51,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-10-10 17:25:53,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=430873.3333333333, ans=0.0 2023-10-10 17:26:01,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=430920.0, ans=0.09899494936611666 2023-10-10 17:26:13,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=430966.6666666667, ans=0.0 2023-10-10 17:26:29,061 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.068e-02 2023-10-10 17:26:39,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-10-10 17:26:43,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=431060.0, ans=0.07 2023-10-10 17:26:49,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=431106.6666666667, ans=0.0 2023-10-10 17:26:49,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.678e+02 1.918e+02 2.276e+02 3.202e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-10 17:26:51,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=431106.6666666667, ans=0.0 2023-10-10 17:27:15,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431200.0, ans=0.1 2023-10-10 17:27:23,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=431246.6666666667, ans=0.2 2023-10-10 17:27:44,598 INFO [train.py:1031] (3/4) Epoch 7, batch 10500, loss[loss=0.2016, simple_loss=0.2918, pruned_loss=0.05569, over 16929.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.3032, pruned_loss=0.06581, over 32626909.46 frames. ], batch size: 82, lr: 5.19e-03, grad_scale: 16.0 2023-10-10 17:28:25,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=431480.0, ans=0.125 2023-10-10 17:28:28,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=431480.0, ans=0.125 2023-10-10 17:28:36,804 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:28:42,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=22.5 2023-10-10 17:28:45,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=431573.3333333333, ans=0.0 2023-10-10 17:28:49,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.695e+02 1.853e+02 2.056e+02 3.284e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-10 17:28:56,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=431620.0, ans=0.0 2023-10-10 17:29:04,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=431620.0, ans=0.2 2023-10-10 17:30:19,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=431900.0, ans=0.0 2023-10-10 17:30:32,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=431946.6666666667, ans=0.05 2023-10-10 17:30:33,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431946.6666666667, ans=0.1 2023-10-10 17:30:35,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=431946.6666666667, ans=0.07 2023-10-10 17:30:43,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=431993.3333333333, ans=0.0 2023-10-10 17:30:45,949 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:30:51,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.772e+02 1.967e+02 2.211e+02 3.232e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-10 17:31:07,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=432086.6666666667, ans=0.125 2023-10-10 17:31:12,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=432133.3333333333, ans=0.125 2023-10-10 17:31:28,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=432180.0, ans=0.125 2023-10-10 17:31:49,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=15.0 2023-10-10 17:31:50,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=432273.3333333333, ans=0.125 2023-10-10 17:31:52,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432273.3333333333, ans=0.1 2023-10-10 17:32:07,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-10-10 17:32:35,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432413.3333333333, ans=0.1 2023-10-10 17:32:53,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-10 17:32:58,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.729e+02 1.894e+02 2.136e+02 3.132e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-10 17:33:01,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=432506.6666666667, ans=0.125 2023-10-10 17:33:09,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.60 vs. limit=15.0 2023-10-10 17:33:22,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=432600.0, ans=0.125 2023-10-10 17:33:29,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432646.6666666667, ans=0.1 2023-10-10 17:33:31,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-10-10 17:33:32,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=432646.6666666667, ans=0.2 2023-10-10 17:33:40,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-10 17:34:17,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432833.3333333333, ans=0.1 2023-10-10 17:34:21,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=432833.3333333333, ans=0.2 2023-10-10 17:34:22,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432880.0, ans=0.1 2023-10-10 17:34:25,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=432880.0, ans=0.2 2023-10-10 17:34:26,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=432880.0, ans=0.125 2023-10-10 17:34:29,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=432880.0, ans=0.05 2023-10-10 17:34:35,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=432926.6666666667, ans=0.125 2023-10-10 17:34:49,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.740e+02 1.912e+02 2.264e+02 2.839e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 17:34:52,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=432973.3333333333, ans=0.0 2023-10-10 17:35:07,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=15.0 2023-10-10 17:35:50,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=433206.6666666667, ans=0.0 2023-10-10 17:35:56,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=433253.3333333333, ans=0.125 2023-10-10 17:35:58,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=433253.3333333333, ans=10.0 2023-10-10 17:36:07,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=433300.0, ans=0.2 2023-10-10 17:36:10,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=433300.0, ans=0.0 2023-10-10 17:36:28,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=433393.3333333333, ans=0.125 2023-10-10 17:36:38,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=433440.0, ans=0.0 2023-10-10 17:36:42,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.641e+02 1.793e+02 2.027e+02 3.111e+02, threshold=3.587e+02, percent-clipped=0.0 2023-10-10 17:36:42,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-10-10 17:36:55,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=433486.6666666667, ans=0.0 2023-10-10 17:36:59,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=12.0 2023-10-10 17:37:02,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433486.6666666667, ans=0.1 2023-10-10 17:37:09,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=433533.3333333333, ans=0.2 2023-10-10 17:37:24,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=15.0 2023-10-10 17:37:40,394 INFO [train.py:1031] (3/4) Epoch 7, batch 11000, loss[loss=0.2062, simple_loss=0.2985, pruned_loss=0.05691, over 16889.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.3033, pruned_loss=0.06594, over 32653938.26 frames. ], batch size: 87, lr: 5.17e-03, grad_scale: 32.0 2023-10-10 17:37:54,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-10-10 17:38:08,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.67 vs. limit=22.5 2023-10-10 17:38:11,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=433766.6666666667, ans=0.125 2023-10-10 17:38:41,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.731e+02 1.990e+02 2.250e+02 3.777e+02, threshold=3.979e+02, percent-clipped=1.0 2023-10-10 17:38:48,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=433953.3333333333, ans=0.125 2023-10-10 17:39:22,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=434093.3333333333, ans=0.125 2023-10-10 17:39:25,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=434093.3333333333, ans=0.0 2023-10-10 17:39:28,962 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.35 vs. limit=15.0 2023-10-10 17:39:30,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-10-10 17:39:38,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434140.0, ans=0.1 2023-10-10 17:39:59,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.63 vs. limit=22.5 2023-10-10 17:40:00,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-10 17:40:05,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=434233.3333333333, ans=0.125 2023-10-10 17:40:15,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.63 vs. limit=6.0 2023-10-10 17:40:46,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.601e+02 1.809e+02 2.087e+02 3.050e+02, threshold=3.619e+02, percent-clipped=0.0 2023-10-10 17:40:50,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=15.0 2023-10-10 17:40:58,671 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:40:59,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=434420.0, ans=0.125 2023-10-10 17:41:02,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=434420.0, ans=0.0 2023-10-10 17:41:18,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434513.3333333333, ans=0.1 2023-10-10 17:41:33,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=434560.0, ans=0.0 2023-10-10 17:41:39,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=434606.6666666667, ans=0.125 2023-10-10 17:41:46,936 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:41:47,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=434653.3333333333, ans=0.125 2023-10-10 17:41:48,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.73 vs. limit=15.0 2023-10-10 17:41:49,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=434653.3333333333, ans=0.125 2023-10-10 17:41:51,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=434653.3333333333, ans=0.125 2023-10-10 17:42:00,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.83 vs. limit=15.0 2023-10-10 17:42:02,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=434700.0, ans=0.125 2023-10-10 17:42:08,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=15.0 2023-10-10 17:42:09,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=434746.6666666667, ans=0.125 2023-10-10 17:42:09,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.77 vs. limit=15.0 2023-10-10 17:42:10,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=434746.6666666667, ans=0.125 2023-10-10 17:42:11,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=434746.6666666667, ans=0.0 2023-10-10 17:42:35,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=12.0 2023-10-10 17:42:37,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.680e+02 1.842e+02 2.036e+02 2.579e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-10 17:43:05,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=22.5 2023-10-10 17:43:32,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-10-10 17:43:35,612 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:43:38,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=435073.3333333333, ans=0.125 2023-10-10 17:43:39,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=435073.3333333333, ans=0.0 2023-10-10 17:43:42,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=435073.3333333333, ans=0.125 2023-10-10 17:43:54,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435120.0, ans=0.1 2023-10-10 17:43:59,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=435120.0, ans=0.0 2023-10-10 17:44:07,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=435166.6666666667, ans=0.0 2023-10-10 17:44:07,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435166.6666666667, ans=0.1 2023-10-10 17:44:37,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=435260.0, ans=0.125 2023-10-10 17:44:37,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=435260.0, ans=0.125 2023-10-10 17:44:41,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=435306.6666666667, ans=0.125 2023-10-10 17:44:44,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.706e+02 1.831e+02 2.217e+02 3.339e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-10 17:44:50,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-10-10 17:45:07,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=435400.0, ans=0.125 2023-10-10 17:45:28,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.94 vs. limit=15.0 2023-10-10 17:45:33,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435493.3333333333, ans=0.1 2023-10-10 17:45:34,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=435493.3333333333, ans=0.0 2023-10-10 17:45:56,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=435586.6666666667, ans=0.125 2023-10-10 17:45:59,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435633.3333333333, ans=0.1 2023-10-10 17:46:03,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=435633.3333333333, ans=0.0 2023-10-10 17:46:09,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-10-10 17:46:32,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435726.6666666667, ans=0.1 2023-10-10 17:46:41,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.91 vs. limit=10.0 2023-10-10 17:46:44,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.827e+02 2.026e+02 2.323e+02 3.333e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-10 17:46:48,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=435773.3333333333, ans=0.05 2023-10-10 17:46:55,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=435820.0, ans=0.125 2023-10-10 17:46:58,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435820.0, ans=0.1 2023-10-10 17:47:02,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=435866.6666666667, ans=0.04949747468305833 2023-10-10 17:47:12,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=435913.3333333333, ans=0.125 2023-10-10 17:47:36,628 INFO [train.py:1031] (3/4) Epoch 7, batch 11500, loss[loss=0.2401, simple_loss=0.3198, pruned_loss=0.08023, over 15540.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.3028, pruned_loss=0.06566, over 32686539.77 frames. ], batch size: 35, lr: 5.16e-03, grad_scale: 16.0 2023-10-10 17:47:48,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=436053.3333333333, ans=0.0 2023-10-10 17:47:50,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.25 vs. limit=22.5 2023-10-10 17:48:09,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=436146.6666666667, ans=0.0 2023-10-10 17:48:30,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=15.0 2023-10-10 17:48:30,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=436193.3333333333, ans=0.125 2023-10-10 17:48:43,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.767e+02 1.974e+02 2.161e+02 3.123e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-10 17:48:58,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436286.6666666667, ans=0.1 2023-10-10 17:49:01,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=436333.3333333333, ans=0.125 2023-10-10 17:49:19,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436380.0, ans=0.1 2023-10-10 17:49:21,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=436380.0, ans=0.125 2023-10-10 17:49:53,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436520.0, ans=0.1 2023-10-10 17:49:59,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=436520.0, ans=0.125 2023-10-10 17:50:01,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=436520.0, ans=0.0 2023-10-10 17:50:08,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436566.6666666667, ans=0.1 2023-10-10 17:50:17,514 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.640e-03 2023-10-10 17:50:25,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=436613.3333333333, ans=0.125 2023-10-10 17:50:33,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=436660.0, ans=0.125 2023-10-10 17:50:38,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=436660.0, ans=0.125 2023-10-10 17:50:42,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-10-10 17:50:46,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.611e+02 1.788e+02 2.012e+02 3.156e+02, threshold=3.576e+02, percent-clipped=0.0 2023-10-10 17:50:58,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=436753.3333333333, ans=0.0 2023-10-10 17:51:21,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=436846.6666666667, ans=0.2 2023-10-10 17:51:29,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=436893.3333333333, ans=0.04949747468305833 2023-10-10 17:52:11,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=437080.0, ans=0.0 2023-10-10 17:52:13,335 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:52:15,338 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:52:19,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=12.0 2023-10-10 17:52:45,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.685e+02 1.821e+02 2.047e+02 3.340e+02, threshold=3.643e+02, percent-clipped=0.0 2023-10-10 17:53:02,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=437220.0, ans=0.0 2023-10-10 17:53:07,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=437220.0, ans=0.5 2023-10-10 17:53:24,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=437313.3333333333, ans=0.125 2023-10-10 17:53:31,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-10-10 17:53:31,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-10-10 17:53:45,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=22.5 2023-10-10 17:54:13,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=437500.0, ans=0.125 2023-10-10 17:54:24,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437500.0, ans=0.1 2023-10-10 17:54:25,963 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:54:46,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=437593.3333333333, ans=0.0 2023-10-10 17:54:48,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=437593.3333333333, ans=0.125 2023-10-10 17:54:57,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.650e+02 1.874e+02 2.186e+02 3.784e+02, threshold=3.747e+02, percent-clipped=2.0 2023-10-10 17:55:14,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=437686.6666666667, ans=0.125 2023-10-10 17:55:17,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=437733.3333333333, ans=0.035 2023-10-10 17:55:52,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437826.6666666667, ans=0.1 2023-10-10 17:56:02,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=437873.3333333333, ans=0.0 2023-10-10 17:56:54,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=438060.0, ans=0.0 2023-10-10 17:56:56,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=438060.0, ans=0.0 2023-10-10 17:57:04,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.724e+02 1.886e+02 2.174e+02 3.195e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-10 17:57:05,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=438106.6666666667, ans=0.125 2023-10-10 17:57:12,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=438153.3333333333, ans=0.1 2023-10-10 17:57:27,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-10-10 17:57:29,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=438200.0, ans=0.2 2023-10-10 17:57:35,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=438246.6666666667, ans=0.0 2023-10-10 17:57:47,120 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:57:58,297 INFO [train.py:1031] (3/4) Epoch 7, batch 12000, loss[loss=0.2477, simple_loss=0.3242, pruned_loss=0.08563, over 16024.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.3028, pruned_loss=0.06519, over 32733719.91 frames. ], batch size: 296, lr: 5.15e-03, grad_scale: 32.0 2023-10-10 17:57:58,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=438340.0, ans=0.125 2023-10-10 17:58:56,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=438573.3333333333, ans=0.125 2023-10-10 17:59:02,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.696e+02 1.899e+02 2.137e+02 3.679e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-10 17:59:18,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=438620.0, ans=0.2 2023-10-10 17:59:23,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=438666.6666666667, ans=6.0 2023-10-10 17:59:25,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=438666.6666666667, ans=0.09899494936611666 2023-10-10 18:00:04,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=438806.6666666667, ans=0.125 2023-10-10 18:00:09,788 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:00:13,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=438853.3333333333, ans=0.125 2023-10-10 18:00:15,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=438853.3333333333, ans=0.0 2023-10-10 18:00:19,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-10-10 18:00:28,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=438900.0, ans=0.125 2023-10-10 18:00:48,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-10-10 18:01:02,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.660e+02 1.811e+02 1.988e+02 2.780e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-10 18:01:03,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=439040.0, ans=0.04949747468305833 2023-10-10 18:01:27,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=439133.3333333333, ans=0.07 2023-10-10 18:01:39,615 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:01:47,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=439226.6666666667, ans=0.125 2023-10-10 18:01:48,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-10-10 18:01:52,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=439226.6666666667, ans=0.0 2023-10-10 18:02:03,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.87 vs. limit=15.0 2023-10-10 18:02:05,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=439320.0, ans=0.125 2023-10-10 18:02:07,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-10-10 18:02:37,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=22.5 2023-10-10 18:02:44,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=439460.0, ans=0.125 2023-10-10 18:02:49,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=439460.0, ans=0.125 2023-10-10 18:03:00,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.749e+02 1.942e+02 2.153e+02 3.165e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-10 18:03:19,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=439600.0, ans=0.1 2023-10-10 18:03:29,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=439646.6666666667, ans=0.0 2023-10-10 18:03:36,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=439646.6666666667, ans=0.2 2023-10-10 18:03:45,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=439693.3333333333, ans=0.125 2023-10-10 18:03:52,912 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.29 vs. limit=15.0 2023-10-10 18:03:53,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=439740.0, ans=15.0 2023-10-10 18:03:53,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.84 vs. limit=15.0 2023-10-10 18:03:56,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=439740.0, ans=0.05 2023-10-10 18:04:20,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=439833.3333333333, ans=0.125 2023-10-10 18:04:28,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439833.3333333333, ans=0.125 2023-10-10 18:04:33,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=439880.0, ans=0.0 2023-10-10 18:04:40,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439880.0, ans=0.1 2023-10-10 18:04:59,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439973.3333333333, ans=0.1 2023-10-10 18:05:00,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.780e+02 2.007e+02 2.313e+02 4.273e+02, threshold=4.014e+02, percent-clipped=1.0 2023-10-10 18:05:16,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=440020.0, ans=0.125 2023-10-10 18:05:20,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=12.0 2023-10-10 18:05:38,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2023-10-10 18:05:43,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=440160.0, ans=0.0 2023-10-10 18:06:17,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.23 vs. limit=15.0 2023-10-10 18:06:55,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440393.3333333333, ans=0.0 2023-10-10 18:06:57,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=440440.0, ans=0.0 2023-10-10 18:07:02,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.807e+02 2.088e+02 2.424e+02 3.506e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-10 18:07:04,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=440440.0, ans=0.125 2023-10-10 18:07:22,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=440533.3333333333, ans=0.1 2023-10-10 18:07:31,422 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.38 vs. limit=15.0 2023-10-10 18:07:57,584 INFO [train.py:1031] (3/4) Epoch 7, batch 12500, loss[loss=0.1976, simple_loss=0.2936, pruned_loss=0.05079, over 16838.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.3024, pruned_loss=0.06518, over 32758312.52 frames. ], batch size: 87, lr: 5.13e-03, grad_scale: 32.0 2023-10-10 18:08:14,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=440720.0, ans=0.0 2023-10-10 18:08:17,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=440720.0, ans=0.0 2023-10-10 18:08:23,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-10-10 18:08:27,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-10-10 18:08:30,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-10-10 18:08:45,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=440860.0, ans=0.07 2023-10-10 18:08:52,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440860.0, ans=0.0 2023-10-10 18:08:55,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=440906.6666666667, ans=0.0 2023-10-10 18:08:57,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=440906.6666666667, ans=0.125 2023-10-10 18:08:59,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.683e+02 1.854e+02 2.094e+02 3.201e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-10 18:08:59,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=440906.6666666667, ans=0.125 2023-10-10 18:09:17,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=441000.0, ans=0.125 2023-10-10 18:09:28,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=441046.6666666667, ans=0.0 2023-10-10 18:09:32,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=441046.6666666667, ans=0.2 2023-10-10 18:09:53,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=441140.0, ans=0.125 2023-10-10 18:09:56,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=441140.0, ans=0.125 2023-10-10 18:10:03,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2023-10-10 18:10:04,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.23 vs. limit=22.5 2023-10-10 18:10:08,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=441186.6666666667, ans=0.0 2023-10-10 18:10:15,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=441233.3333333333, ans=0.025 2023-10-10 18:10:40,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.35 vs. limit=15.0 2023-10-10 18:10:43,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=441326.6666666667, ans=0.0 2023-10-10 18:10:56,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.680e+02 1.860e+02 2.167e+02 3.640e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-10 18:11:04,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=441420.0, ans=0.0 2023-10-10 18:11:40,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441560.0, ans=0.125 2023-10-10 18:11:41,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441560.0, ans=0.1 2023-10-10 18:11:49,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=441560.0, ans=0.0 2023-10-10 18:11:51,815 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:11:54,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-10-10 18:12:37,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=441793.3333333333, ans=0.0 2023-10-10 18:12:49,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=441793.3333333333, ans=0.09899494936611666 2023-10-10 18:12:52,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=441840.0, ans=0.0 2023-10-10 18:12:57,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.684e+02 1.935e+02 2.132e+02 3.537e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-10 18:13:40,896 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.90 vs. limit=15.0 2023-10-10 18:13:43,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=442026.6666666667, ans=0.05 2023-10-10 18:13:47,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=442026.6666666667, ans=0.125 2023-10-10 18:14:26,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=442213.3333333333, ans=0.125 2023-10-10 18:14:53,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442306.6666666667, ans=0.125 2023-10-10 18:14:58,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.746e+02 1.925e+02 2.149e+02 3.455e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-10 18:15:24,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=442400.0, ans=0.0 2023-10-10 18:15:33,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=442446.6666666667, ans=0.2 2023-10-10 18:15:34,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442446.6666666667, ans=0.1 2023-10-10 18:15:40,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=442493.3333333333, ans=0.2 2023-10-10 18:15:42,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-10 18:16:04,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=442586.6666666667, ans=0.0 2023-10-10 18:16:19,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.90 vs. limit=15.0 2023-10-10 18:16:32,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=442680.0, ans=0.0 2023-10-10 18:16:45,764 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-10-10 18:17:00,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.619e+02 1.806e+02 2.042e+02 3.289e+02, threshold=3.612e+02, percent-clipped=0.0 2023-10-10 18:17:10,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=442820.0, ans=0.125 2023-10-10 18:17:13,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=442820.0, ans=0.2 2023-10-10 18:17:26,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=442866.6666666667, ans=0.2 2023-10-10 18:17:27,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-10-10 18:17:28,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442866.6666666667, ans=0.1 2023-10-10 18:17:38,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=442913.3333333333, ans=0.125 2023-10-10 18:17:52,552 INFO [train.py:1031] (3/4) Epoch 7, batch 13000, loss[loss=0.2507, simple_loss=0.3197, pruned_loss=0.09089, over 16030.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.303, pruned_loss=0.0653, over 32757197.33 frames. ], batch size: 296, lr: 5.12e-03, grad_scale: 16.0 2023-10-10 18:18:15,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=443053.3333333333, ans=0.125 2023-10-10 18:18:34,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443146.6666666667, ans=0.1 2023-10-10 18:18:43,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-10-10 18:19:01,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=443240.0, ans=0.0 2023-10-10 18:19:04,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.719e+02 1.904e+02 2.119e+02 2.835e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 18:19:28,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.30 vs. limit=15.0 2023-10-10 18:20:01,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=443473.3333333333, ans=10.0 2023-10-10 18:21:05,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.783e+02 1.996e+02 2.275e+02 3.876e+02, threshold=3.992e+02, percent-clipped=1.0 2023-10-10 18:21:35,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=443800.0, ans=0.07 2023-10-10 18:21:51,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=443893.3333333333, ans=0.2 2023-10-10 18:22:29,225 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-10-10 18:22:31,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444033.3333333333, ans=0.1 2023-10-10 18:22:49,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=444126.6666666667, ans=0.125 2023-10-10 18:22:55,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=444126.6666666667, ans=0.1 2023-10-10 18:23:00,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=444173.3333333333, ans=0.0 2023-10-10 18:23:08,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.656e+02 1.837e+02 2.031e+02 2.706e+02, threshold=3.674e+02, percent-clipped=0.0 2023-10-10 18:23:13,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=444220.0, ans=0.0 2023-10-10 18:23:29,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=444266.6666666667, ans=0.025 2023-10-10 18:23:41,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=444313.3333333333, ans=0.125 2023-10-10 18:23:55,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=444360.0, ans=0.0 2023-10-10 18:24:23,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=444500.0, ans=0.125 2023-10-10 18:24:23,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-10 18:24:56,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=444640.0, ans=0.125 2023-10-10 18:24:57,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.39 vs. limit=22.5 2023-10-10 18:25:03,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-10-10 18:25:04,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.753e+02 1.964e+02 2.230e+02 2.801e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 18:25:05,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=444640.0, ans=0.125 2023-10-10 18:25:08,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=444686.6666666667, ans=0.2 2023-10-10 18:25:08,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=444686.6666666667, ans=0.0 2023-10-10 18:25:43,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-10-10 18:25:50,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=444826.6666666667, ans=0.125 2023-10-10 18:25:51,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=444826.6666666667, ans=0.0 2023-10-10 18:26:15,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=444920.0, ans=0.125 2023-10-10 18:26:16,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=444920.0, ans=0.125 2023-10-10 18:26:25,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=444966.6666666667, ans=0.125 2023-10-10 18:26:31,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=444966.6666666667, ans=10.0 2023-10-10 18:26:42,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=445013.3333333333, ans=0.04949747468305833 2023-10-10 18:27:04,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.746e+02 1.932e+02 2.346e+02 3.404e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-10 18:27:16,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.08 vs. limit=15.0 2023-10-10 18:27:24,146 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2023-10-10 18:27:33,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445246.6666666667, ans=0.1 2023-10-10 18:27:34,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=445246.6666666667, ans=0.125 2023-10-10 18:27:53,848 INFO [train.py:1031] (3/4) Epoch 7, batch 13500, loss[loss=0.1898, simple_loss=0.2565, pruned_loss=0.06156, over 12255.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.3023, pruned_loss=0.06514, over 32761117.26 frames. ], batch size: 440, lr: 5.11e-03, grad_scale: 32.0 2023-10-10 18:28:00,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=445340.0, ans=0.0 2023-10-10 18:28:06,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=445386.6666666667, ans=0.1 2023-10-10 18:28:13,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445386.6666666667, ans=0.1 2023-10-10 18:28:24,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=445433.3333333333, ans=0.0 2023-10-10 18:28:47,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=445526.6666666667, ans=0.125 2023-10-10 18:28:56,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.709e+02 1.944e+02 2.257e+02 3.330e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-10 18:28:57,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.85 vs. limit=10.0 2023-10-10 18:29:00,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=445620.0, ans=0.125 2023-10-10 18:29:03,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-10-10 18:29:03,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=445620.0, ans=0.1 2023-10-10 18:29:28,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=445713.3333333333, ans=0.2 2023-10-10 18:29:49,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=445806.6666666667, ans=0.125 2023-10-10 18:29:50,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445806.6666666667, ans=0.1 2023-10-10 18:29:53,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445806.6666666667, ans=0.1 2023-10-10 18:29:56,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-10-10 18:30:02,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=445853.3333333333, ans=0.0 2023-10-10 18:30:05,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-10 18:30:05,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.79 vs. limit=15.0 2023-10-10 18:30:06,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=445853.3333333333, ans=0.125 2023-10-10 18:30:20,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=445946.6666666667, ans=0.0 2023-10-10 18:30:29,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=445993.3333333333, ans=0.07 2023-10-10 18:31:21,241 INFO [train.py:1031] (3/4) Epoch 8, batch 0, loss[loss=0.1778, simple_loss=0.2734, pruned_loss=0.04112, over 16924.00 frames. ], tot_loss[loss=0.1778, simple_loss=0.2734, pruned_loss=0.04112, over 16924.00 frames. ], batch size: 82, lr: 4.73e-03, grad_scale: 32.0 2023-10-10 18:31:21,242 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-10 18:31:30,341 INFO [train.py:1063] (3/4) Epoch 8, validation: loss=0.2272, simple_loss=0.314, pruned_loss=0.07016, over 1020973.00 frames. 2023-10-10 18:31:30,342 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-10 18:31:31,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.738e+02 1.931e+02 2.283e+02 5.705e+02, threshold=3.861e+02, percent-clipped=1.0 2023-10-10 18:32:16,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=446250.0, ans=0.125 2023-10-10 18:32:26,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446250.0, ans=0.125 2023-10-10 18:32:26,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=446250.0, ans=0.2 2023-10-10 18:32:46,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.23 vs. limit=15.0 2023-10-10 18:32:59,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.57 vs. limit=22.5 2023-10-10 18:33:09,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=15.0 2023-10-10 18:33:12,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446436.6666666667, ans=0.1 2023-10-10 18:33:32,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.768e+02 2.068e+02 2.359e+02 4.985e+02, threshold=4.136e+02, percent-clipped=3.0 2023-10-10 18:33:41,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.04 vs. limit=22.5 2023-10-10 18:33:42,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-10-10 18:33:56,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=446623.3333333333, ans=0.04949747468305833 2023-10-10 18:34:19,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446716.6666666667, ans=0.1 2023-10-10 18:34:22,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=446716.6666666667, ans=0.125 2023-10-10 18:34:25,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=446716.6666666667, ans=0.0 2023-10-10 18:34:27,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=446716.6666666667, ans=0.0 2023-10-10 18:34:39,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=446763.3333333333, ans=0.125 2023-10-10 18:34:47,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=446810.0, ans=0.125 2023-10-10 18:35:27,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.782e+02 1.937e+02 2.179e+02 3.114e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-10 18:35:37,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=447043.3333333333, ans=0.125 2023-10-10 18:35:52,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=447090.0, ans=0.125 2023-10-10 18:35:59,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=447136.6666666667, ans=0.95 2023-10-10 18:36:00,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447136.6666666667, ans=0.0 2023-10-10 18:37:06,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=447370.0, ans=0.125 2023-10-10 18:37:12,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.92 vs. limit=15.0 2023-10-10 18:37:12,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.83 vs. limit=22.5 2023-10-10 18:37:21,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=447416.6666666667, ans=0.0 2023-10-10 18:37:28,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.786e+02 1.930e+02 2.325e+02 3.073e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-10 18:37:50,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=447556.6666666667, ans=0.0 2023-10-10 18:38:11,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.67 vs. limit=10.0 2023-10-10 18:38:21,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=447650.0, ans=0.125 2023-10-10 18:38:23,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=447650.0, ans=0.04949747468305833 2023-10-10 18:38:29,198 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:38:47,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-10-10 18:38:50,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=447790.0, ans=15.0 2023-10-10 18:38:51,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447790.0, ans=0.0 2023-10-10 18:38:58,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=447790.0, ans=0.125 2023-10-10 18:39:05,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447836.6666666667, ans=0.1 2023-10-10 18:39:21,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447930.0, ans=0.1 2023-10-10 18:39:24,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.727e+02 1.913e+02 2.087e+02 2.905e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 18:39:30,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=447930.0, ans=22.5 2023-10-10 18:39:53,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=448023.3333333333, ans=0.09899494936611666 2023-10-10 18:39:57,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-10-10 18:40:03,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=448070.0, ans=0.125 2023-10-10 18:40:18,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.38 vs. limit=15.0 2023-10-10 18:40:19,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-10-10 18:40:28,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=448163.3333333333, ans=0.0 2023-10-10 18:40:37,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=448210.0, ans=0.125 2023-10-10 18:41:11,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=448350.0, ans=0.2 2023-10-10 18:41:11,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.96 vs. limit=15.0 2023-10-10 18:41:18,209 INFO [train.py:1031] (3/4) Epoch 8, batch 500, loss[loss=0.2122, simple_loss=0.2988, pruned_loss=0.06274, over 16878.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.3015, pruned_loss=0.06497, over 7283762.25 frames. ], batch size: 155, lr: 4.72e-03, grad_scale: 16.0 2023-10-10 18:41:20,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.702e+02 1.882e+02 2.126e+02 3.210e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-10 18:41:20,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=448396.6666666667, ans=0.125 2023-10-10 18:41:27,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=448396.6666666667, ans=0.125 2023-10-10 18:41:42,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=448490.0, ans=0.125 2023-10-10 18:42:29,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=448676.6666666667, ans=0.125 2023-10-10 18:42:30,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=448676.6666666667, ans=0.125 2023-10-10 18:43:05,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.12 vs. limit=22.5 2023-10-10 18:43:14,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.846e+02 2.141e+02 2.412e+02 3.601e+02, threshold=4.282e+02, percent-clipped=0.0 2023-10-10 18:43:47,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449003.3333333333, ans=0.1 2023-10-10 18:44:22,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-10-10 18:44:28,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=449143.3333333333, ans=0.0 2023-10-10 18:44:49,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=449236.6666666667, ans=0.125 2023-10-10 18:45:03,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.805e+02 2.063e+02 2.309e+02 3.366e+02, threshold=4.127e+02, percent-clipped=0.0 2023-10-10 18:45:14,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=12.0 2023-10-10 18:45:18,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=449376.6666666667, ans=0.125 2023-10-10 18:45:18,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=449376.6666666667, ans=0.1 2023-10-10 18:45:20,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449376.6666666667, ans=0.125 2023-10-10 18:45:35,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449470.0, ans=0.1 2023-10-10 18:45:54,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-10-10 18:45:58,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=449516.6666666667, ans=0.125 2023-10-10 18:46:16,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=449610.0, ans=0.0 2023-10-10 18:46:21,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449610.0, ans=0.1 2023-10-10 18:46:33,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2023-10-10 18:46:36,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=449656.6666666667, ans=0.125 2023-10-10 18:46:49,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=449750.0, ans=0.04949747468305833 2023-10-10 18:47:04,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.721e+02 1.886e+02 2.253e+02 3.352e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-10 18:47:33,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-10-10 18:47:48,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.59 vs. limit=6.0 2023-10-10 18:48:27,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=450076.6666666667, ans=0.125 2023-10-10 18:48:28,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=450076.6666666667, ans=0.125 2023-10-10 18:49:02,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=450216.6666666667, ans=0.125 2023-10-10 18:49:02,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=450216.6666666667, ans=0.125 2023-10-10 18:49:02,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=450216.6666666667, ans=0.2 2023-10-10 18:49:07,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=450216.6666666667, ans=0.025 2023-10-10 18:49:12,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.723e+02 1.977e+02 2.211e+02 3.266e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-10 18:49:20,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=450310.0, ans=0.125 2023-10-10 18:49:24,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=450310.0, ans=0.2 2023-10-10 18:49:37,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=450356.6666666667, ans=0.09899494936611666 2023-10-10 18:49:41,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=450356.6666666667, ans=0.125 2023-10-10 18:49:58,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=450450.0, ans=0.0 2023-10-10 18:50:17,891 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:50:35,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.38 vs. limit=22.5 2023-10-10 18:50:42,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=450590.0, ans=0.2 2023-10-10 18:51:01,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=450683.3333333333, ans=0.125 2023-10-10 18:51:02,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=450683.3333333333, ans=0.125 2023-10-10 18:51:04,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=450683.3333333333, ans=0.125 2023-10-10 18:51:04,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=450683.3333333333, ans=0.2 2023-10-10 18:51:10,219 INFO [train.py:1031] (3/4) Epoch 8, batch 1000, loss[loss=0.2273, simple_loss=0.3199, pruned_loss=0.06733, over 16862.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.303, pruned_loss=0.06589, over 12918576.87 frames. ], batch size: 175, lr: 4.71e-03, grad_scale: 32.0 2023-10-10 18:51:12,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.659e+02 1.827e+02 2.112e+02 3.229e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-10 18:51:13,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=450730.0, ans=0.0 2023-10-10 18:51:38,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=450823.3333333333, ans=0.2 2023-10-10 18:51:39,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=450823.3333333333, ans=0.0 2023-10-10 18:51:39,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=450823.3333333333, ans=0.125 2023-10-10 18:52:08,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=450963.3333333333, ans=0.2 2023-10-10 18:52:20,112 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.61 vs. limit=22.5 2023-10-10 18:52:22,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=451010.0, ans=0.125 2023-10-10 18:52:47,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.48 vs. limit=12.0 2023-10-10 18:52:48,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=451150.0, ans=0.0 2023-10-10 18:52:55,214 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-10-10 18:53:00,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.715e+02 2.001e+02 2.301e+02 3.343e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-10 18:53:02,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=451196.6666666667, ans=0.125 2023-10-10 18:53:08,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=451196.6666666667, ans=0.0 2023-10-10 18:53:19,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=451243.3333333333, ans=0.07 2023-10-10 18:53:59,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=451383.3333333333, ans=0.125 2023-10-10 18:54:02,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=451430.0, ans=0.125 2023-10-10 18:54:13,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=451476.6666666667, ans=0.2 2023-10-10 18:54:14,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=451476.6666666667, ans=0.1 2023-10-10 18:54:35,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=451523.3333333333, ans=0.1 2023-10-10 18:54:44,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-10-10 18:54:50,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=451570.0, ans=0.0 2023-10-10 18:55:05,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=451616.6666666667, ans=0.125 2023-10-10 18:55:10,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=451663.3333333333, ans=0.2 2023-10-10 18:55:11,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.592e+02 1.748e+02 2.020e+02 3.161e+02, threshold=3.496e+02, percent-clipped=0.0 2023-10-10 18:55:20,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.93 vs. limit=15.0 2023-10-10 18:55:21,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=451710.0, ans=0.125 2023-10-10 18:55:24,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=451710.0, ans=0.125 2023-10-10 18:55:32,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=451756.6666666667, ans=0.0 2023-10-10 18:56:13,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=451896.6666666667, ans=0.125 2023-10-10 18:56:24,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=451943.3333333333, ans=0.125 2023-10-10 18:56:30,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-10-10 18:56:43,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=452036.6666666667, ans=0.125 2023-10-10 18:56:52,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.72 vs. limit=15.0 2023-10-10 18:57:10,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.251e+02 1.690e+02 1.889e+02 2.045e+02 2.714e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-10 18:57:22,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452176.6666666667, ans=0.1 2023-10-10 18:57:39,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452223.3333333333, ans=0.1 2023-10-10 18:58:02,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=452316.6666666667, ans=0.0 2023-10-10 18:58:04,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=452316.6666666667, ans=0.09899494936611666 2023-10-10 18:58:33,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452456.6666666667, ans=0.1 2023-10-10 18:58:34,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=452456.6666666667, ans=0.125 2023-10-10 18:59:08,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=452596.6666666667, ans=0.125 2023-10-10 18:59:08,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=452596.6666666667, ans=0.125 2023-10-10 18:59:09,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.683e+02 1.885e+02 2.096e+02 2.668e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-10 18:59:12,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=452596.6666666667, ans=0.0 2023-10-10 18:59:19,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=452643.3333333333, ans=0.0 2023-10-10 18:59:20,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=452643.3333333333, ans=0.2 2023-10-10 18:59:52,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.58 vs. limit=15.0 2023-10-10 18:59:52,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.64 vs. limit=22.5 2023-10-10 18:59:55,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=452783.3333333333, ans=0.0 2023-10-10 18:59:55,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=452783.3333333333, ans=0.125 2023-10-10 19:00:06,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452830.0, ans=0.1 2023-10-10 19:00:07,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=452830.0, ans=0.2 2023-10-10 19:00:09,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=452830.0, ans=0.0 2023-10-10 19:00:13,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=452830.0, ans=0.0 2023-10-10 19:00:25,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=452876.6666666667, ans=0.125 2023-10-10 19:00:37,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-10-10 19:00:43,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=452970.0, ans=10.0 2023-10-10 19:01:03,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-10-10 19:01:04,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=453016.6666666667, ans=0.2 2023-10-10 19:01:06,225 INFO [train.py:1031] (3/4) Epoch 8, batch 1500, loss[loss=0.2046, simple_loss=0.2934, pruned_loss=0.05787, over 16801.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.3013, pruned_loss=0.06493, over 17312782.89 frames. ], batch size: 87, lr: 4.70e-03, grad_scale: 32.0 2023-10-10 19:01:07,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.640e+02 1.833e+02 2.091e+02 2.968e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 19:01:21,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=453110.0, ans=0.125 2023-10-10 19:01:24,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=453110.0, ans=0.125 2023-10-10 19:01:40,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=453156.6666666667, ans=0.0 2023-10-10 19:01:42,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.77 vs. limit=22.5 2023-10-10 19:01:48,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=453203.3333333333, ans=0.125 2023-10-10 19:02:05,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=453250.0, ans=0.125 2023-10-10 19:02:08,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=453296.6666666667, ans=0.125 2023-10-10 19:02:15,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-10-10 19:02:42,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=453390.0, ans=0.125 2023-10-10 19:02:45,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=453436.6666666667, ans=0.125 2023-10-10 19:02:53,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=453436.6666666667, ans=10.0 2023-10-10 19:02:55,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.67 vs. limit=15.0 2023-10-10 19:03:08,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-10-10 19:03:10,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.760e+02 1.969e+02 2.327e+02 4.256e+02, threshold=3.939e+02, percent-clipped=4.0 2023-10-10 19:03:45,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=453670.0, ans=0.125 2023-10-10 19:03:52,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=453716.6666666667, ans=0.125 2023-10-10 19:03:55,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=453716.6666666667, ans=0.2 2023-10-10 19:04:17,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=453763.3333333333, ans=0.0 2023-10-10 19:04:19,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=453810.0, ans=0.125 2023-10-10 19:04:42,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=453856.6666666667, ans=0.0 2023-10-10 19:04:57,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453950.0, ans=0.1 2023-10-10 19:05:10,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.685e+02 1.921e+02 2.100e+02 3.249e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-10 19:05:18,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=453996.6666666667, ans=0.07 2023-10-10 19:05:35,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=454090.0, ans=0.125 2023-10-10 19:05:41,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=454090.0, ans=0.125 2023-10-10 19:05:56,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=454183.3333333333, ans=0.125 2023-10-10 19:06:00,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=454183.3333333333, ans=0.125 2023-10-10 19:06:00,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=454183.3333333333, ans=0.125 2023-10-10 19:06:09,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-10-10 19:06:14,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=454230.0, ans=0.125 2023-10-10 19:07:11,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.225e+02 1.617e+02 1.760e+02 1.975e+02 2.464e+02, threshold=3.520e+02, percent-clipped=0.0 2023-10-10 19:07:19,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=454463.3333333333, ans=0.07 2023-10-10 19:07:36,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=454556.6666666667, ans=0.0 2023-10-10 19:08:23,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-10-10 19:08:42,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=454836.6666666667, ans=0.5 2023-10-10 19:09:08,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.667e+02 1.918e+02 2.165e+02 3.283e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-10 19:09:12,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=454930.0, ans=0.0 2023-10-10 19:09:13,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=454930.0, ans=0.125 2023-10-10 19:09:31,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=455023.3333333333, ans=0.0 2023-10-10 19:09:33,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-10 19:09:34,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-10-10 19:09:41,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=455070.0, ans=0.5 2023-10-10 19:10:06,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=455116.6666666667, ans=0.0 2023-10-10 19:10:11,542 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-10-10 19:10:12,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=455163.3333333333, ans=0.0 2023-10-10 19:10:12,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=455163.3333333333, ans=0.2 2023-10-10 19:10:16,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.06 vs. limit=10.0 2023-10-10 19:10:30,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455210.0, ans=0.1 2023-10-10 19:10:36,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=455210.0, ans=0.1 2023-10-10 19:10:36,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-10-10 19:10:55,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-10-10 19:11:15,482 INFO [train.py:1031] (3/4) Epoch 8, batch 2000, loss[loss=0.2117, simple_loss=0.3059, pruned_loss=0.05877, over 16846.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.3014, pruned_loss=0.06453, over 20742873.69 frames. ], batch size: 188, lr: 4.68e-03, grad_scale: 32.0 2023-10-10 19:11:18,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-10-10 19:11:18,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.641e+02 1.856e+02 2.042e+02 2.982e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-10 19:11:22,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=455396.6666666667, ans=0.2 2023-10-10 19:11:28,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=455443.3333333333, ans=0.125 2023-10-10 19:11:30,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=455443.3333333333, ans=0.125 2023-10-10 19:11:37,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=455443.3333333333, ans=0.125 2023-10-10 19:11:37,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=455443.3333333333, ans=0.125 2023-10-10 19:11:52,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=455490.0, ans=0.125 2023-10-10 19:12:09,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=455536.6666666667, ans=0.2 2023-10-10 19:12:13,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=455583.3333333333, ans=0.0 2023-10-10 19:12:20,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=15.0 2023-10-10 19:12:20,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=455583.3333333333, ans=0.125 2023-10-10 19:12:29,532 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.516e-03 2023-10-10 19:12:42,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455676.6666666667, ans=0.1 2023-10-10 19:13:30,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.670e+02 1.803e+02 1.969e+02 2.816e+02, threshold=3.607e+02, percent-clipped=0.0 2023-10-10 19:13:45,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=455863.3333333333, ans=0.125 2023-10-10 19:13:57,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455910.0, ans=0.1 2023-10-10 19:14:14,514 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:14:18,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=455956.6666666667, ans=0.1 2023-10-10 19:14:21,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=455956.6666666667, ans=0.2 2023-10-10 19:14:37,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=456003.3333333333, ans=0.0 2023-10-10 19:14:41,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=456050.0, ans=0.125 2023-10-10 19:14:46,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=456050.0, ans=0.125 2023-10-10 19:14:50,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=456050.0, ans=0.0 2023-10-10 19:15:34,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=15.0 2023-10-10 19:15:45,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=456283.3333333333, ans=0.0 2023-10-10 19:15:56,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.952e+02 2.217e+02 2.638e+02 3.557e+02, threshold=4.434e+02, percent-clipped=0.0 2023-10-10 19:16:00,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-10 19:16:20,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=456423.3333333333, ans=0.125 2023-10-10 19:17:02,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-10 19:17:13,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=456656.6666666667, ans=0.125 2023-10-10 19:17:26,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.17 vs. limit=15.0 2023-10-10 19:17:42,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456750.0, ans=0.1 2023-10-10 19:17:44,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-10-10 19:17:48,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=456796.6666666667, ans=0.0 2023-10-10 19:17:49,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.777e+02 1.902e+02 2.152e+02 3.407e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-10 19:18:24,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=456936.6666666667, ans=0.1 2023-10-10 19:18:26,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456936.6666666667, ans=0.1 2023-10-10 19:18:37,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=456983.3333333333, ans=0.035 2023-10-10 19:18:41,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=456983.3333333333, ans=0.125 2023-10-10 19:19:01,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-10-10 19:19:07,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=457076.6666666667, ans=0.0 2023-10-10 19:19:15,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=457123.3333333333, ans=0.125 2023-10-10 19:19:18,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=457123.3333333333, ans=10.0 2023-10-10 19:19:33,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457216.6666666667, ans=0.1 2023-10-10 19:19:36,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-10-10 19:19:36,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.64 vs. limit=22.5 2023-10-10 19:19:45,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.734e+02 1.937e+02 2.104e+02 2.901e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-10 19:19:54,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=457310.0, ans=0.07 2023-10-10 19:20:26,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=457450.0, ans=0.125 2023-10-10 19:20:30,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=457450.0, ans=0.125 2023-10-10 19:20:34,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=457496.6666666667, ans=0.125 2023-10-10 19:20:44,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=457496.6666666667, ans=0.125 2023-10-10 19:20:49,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=457543.3333333333, ans=0.0 2023-10-10 19:20:56,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=457590.0, ans=0.0 2023-10-10 19:21:02,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=457590.0, ans=0.015 2023-10-10 19:21:27,644 INFO [train.py:1031] (3/4) Epoch 8, batch 2500, loss[loss=0.1953, simple_loss=0.2884, pruned_loss=0.05112, over 16869.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.3016, pruned_loss=0.06478, over 23408113.74 frames. ], batch size: 98, lr: 4.67e-03, grad_scale: 16.0 2023-10-10 19:21:32,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.734e+02 1.898e+02 2.162e+02 2.982e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-10 19:21:55,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=457823.3333333333, ans=0.95 2023-10-10 19:22:00,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=457870.0, ans=0.125 2023-10-10 19:22:04,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=457870.0, ans=0.125 2023-10-10 19:22:32,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=458010.0, ans=0.125 2023-10-10 19:22:49,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=458056.6666666667, ans=0.125 2023-10-10 19:22:53,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=458103.3333333333, ans=0.125 2023-10-10 19:23:21,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.690e+02 1.868e+02 2.066e+02 2.868e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-10 19:23:35,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=458243.3333333333, ans=0.125 2023-10-10 19:23:43,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458290.0, ans=0.1 2023-10-10 19:24:09,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.39 vs. limit=15.0 2023-10-10 19:24:10,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458383.3333333333, ans=0.1 2023-10-10 19:24:15,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=458430.0, ans=0.125 2023-10-10 19:24:44,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-10-10 19:24:55,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.81 vs. limit=22.5 2023-10-10 19:25:04,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=458616.6666666667, ans=0.125 2023-10-10 19:25:07,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-10-10 19:25:18,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.695e+02 1.814e+02 2.017e+02 2.886e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-10 19:25:24,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=458663.3333333333, ans=0.0 2023-10-10 19:25:39,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=458756.6666666667, ans=0.2 2023-10-10 19:26:02,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-10-10 19:26:05,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=458850.0, ans=0.0 2023-10-10 19:26:15,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=22.5 2023-10-10 19:26:34,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458943.3333333333, ans=0.1 2023-10-10 19:26:34,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=458943.3333333333, ans=0.125 2023-10-10 19:26:34,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=458943.3333333333, ans=0.125 2023-10-10 19:26:38,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458943.3333333333, ans=0.1 2023-10-10 19:26:41,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.34 vs. limit=15.0 2023-10-10 19:26:41,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=458990.0, ans=0.125 2023-10-10 19:26:41,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=458990.0, ans=0.09899494936611666 2023-10-10 19:26:45,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.58 vs. limit=12.0 2023-10-10 19:27:00,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=459036.6666666667, ans=0.0 2023-10-10 19:27:19,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=459130.0, ans=15.0 2023-10-10 19:27:19,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=459130.0, ans=0.0 2023-10-10 19:27:21,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.668e+02 1.872e+02 2.019e+02 2.775e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-10 19:27:27,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2023-10-10 19:27:53,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=459223.3333333333, ans=0.125 2023-10-10 19:28:05,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=459270.0, ans=0.0 2023-10-10 19:28:16,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=15.0 2023-10-10 19:28:22,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2023-10-10 19:28:28,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=459363.3333333333, ans=0.2 2023-10-10 19:28:59,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=459456.6666666667, ans=0.125 2023-10-10 19:29:04,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=459503.3333333333, ans=0.125 2023-10-10 19:29:05,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459503.3333333333, ans=0.1 2023-10-10 19:29:13,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=459550.0, ans=0.125 2023-10-10 19:29:17,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-10-10 19:29:19,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459550.0, ans=0.125 2023-10-10 19:29:28,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=459596.6666666667, ans=0.04949747468305833 2023-10-10 19:29:33,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.714e+02 1.862e+02 2.133e+02 3.067e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-10 19:29:44,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=459643.3333333333, ans=0.0 2023-10-10 19:29:44,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=459643.3333333333, ans=0.125 2023-10-10 19:29:47,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459643.3333333333, ans=0.125 2023-10-10 19:29:50,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=459643.3333333333, ans=0.125 2023-10-10 19:30:09,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.20 vs. limit=15.0 2023-10-10 19:30:34,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=459830.0, ans=0.125 2023-10-10 19:30:46,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=459876.6666666667, ans=0.0 2023-10-10 19:30:51,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.17 vs. limit=15.0 2023-10-10 19:31:05,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=459970.0, ans=0.125 2023-10-10 19:31:19,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=460016.6666666667, ans=0.0 2023-10-10 19:31:25,038 INFO [train.py:1031] (3/4) Epoch 8, batch 3000, loss[loss=0.2001, simple_loss=0.2889, pruned_loss=0.05566, over 16757.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.3005, pruned_loss=0.06442, over 25480353.17 frames. ], batch size: 81, lr: 4.66e-03, grad_scale: 32.0 2023-10-10 19:31:31,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.620e+02 1.833e+02 2.039e+02 2.693e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 19:32:06,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.45 vs. limit=15.0 2023-10-10 19:32:14,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=460250.0, ans=0.2 2023-10-10 19:32:19,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=460250.0, ans=0.05 2023-10-10 19:32:20,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-10-10 19:32:28,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-10-10 19:32:33,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=460343.3333333333, ans=0.125 2023-10-10 19:32:59,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=460436.6666666667, ans=0.125 2023-10-10 19:33:14,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=460483.3333333333, ans=0.125 2023-10-10 19:33:16,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=460483.3333333333, ans=0.125 2023-10-10 19:33:28,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.256e+02 1.699e+02 1.981e+02 2.265e+02 3.339e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-10 19:33:36,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=460576.6666666667, ans=0.025 2023-10-10 19:33:56,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460623.3333333333, ans=0.0 2023-10-10 19:34:00,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=460623.3333333333, ans=0.125 2023-10-10 19:34:11,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=460670.0, ans=15.0 2023-10-10 19:34:24,071 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:34:44,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.75 vs. limit=22.5 2023-10-10 19:34:47,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-10-10 19:35:09,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.21 vs. limit=22.5 2023-10-10 19:35:29,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.634e+02 1.811e+02 2.162e+02 2.975e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-10 19:35:52,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=461090.0, ans=0.0 2023-10-10 19:35:53,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=461090.0, ans=0.125 2023-10-10 19:36:03,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-10-10 19:37:38,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=461463.3333333333, ans=0.125 2023-10-10 19:37:39,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.664e+02 1.777e+02 1.978e+02 2.612e+02, threshold=3.555e+02, percent-clipped=0.0 2023-10-10 19:37:42,457 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:37:44,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=461463.3333333333, ans=0.1 2023-10-10 19:38:12,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=461603.3333333333, ans=0.125 2023-10-10 19:38:20,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461603.3333333333, ans=0.125 2023-10-10 19:38:43,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=461696.6666666667, ans=0.0 2023-10-10 19:38:45,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=461743.3333333333, ans=0.125 2023-10-10 19:38:46,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=461743.3333333333, ans=0.125 2023-10-10 19:38:55,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=12.0 2023-10-10 19:39:15,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-10 19:39:24,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=461883.3333333333, ans=0.125 2023-10-10 19:39:24,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=461883.3333333333, ans=0.1 2023-10-10 19:39:30,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=461883.3333333333, ans=0.0 2023-10-10 19:39:39,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.708e+02 1.969e+02 2.305e+02 3.490e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-10 19:39:39,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=461930.0, ans=0.125 2023-10-10 19:39:40,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=461930.0, ans=0.125 2023-10-10 19:40:20,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=462116.6666666667, ans=0.125 2023-10-10 19:40:40,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=462210.0, ans=0.125 2023-10-10 19:41:08,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462303.3333333333, ans=0.1 2023-10-10 19:41:12,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=462303.3333333333, ans=0.125 2023-10-10 19:41:24,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.23 vs. limit=15.0 2023-10-10 19:41:26,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=462350.0, ans=0.125 2023-10-10 19:41:28,828 INFO [train.py:1031] (3/4) Epoch 8, batch 3500, loss[loss=0.2238, simple_loss=0.303, pruned_loss=0.07229, over 16929.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.3003, pruned_loss=0.06433, over 27112097.35 frames. ], batch size: 77, lr: 4.65e-03, grad_scale: 32.0 2023-10-10 19:41:33,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.706e+02 1.905e+02 2.085e+02 2.469e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-10 19:41:38,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=462396.6666666667, ans=0.1 2023-10-10 19:41:38,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=22.5 2023-10-10 19:42:02,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=462536.6666666667, ans=0.125 2023-10-10 19:42:11,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=462536.6666666667, ans=0.02 2023-10-10 19:42:18,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=462583.3333333333, ans=0.035 2023-10-10 19:42:22,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=6.0 2023-10-10 19:42:31,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=462630.0, ans=0.125 2023-10-10 19:42:35,003 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:42:41,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=462676.6666666667, ans=0.125 2023-10-10 19:43:35,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=462863.3333333333, ans=0.07 2023-10-10 19:43:36,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.692e+02 1.904e+02 2.331e+02 3.234e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 19:43:40,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=462863.3333333333, ans=0.2 2023-10-10 19:44:02,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=462956.6666666667, ans=0.125 2023-10-10 19:44:07,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=463003.3333333333, ans=0.0 2023-10-10 19:44:18,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463003.3333333333, ans=0.1 2023-10-10 19:44:37,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463096.6666666667, ans=0.1 2023-10-10 19:44:55,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-10-10 19:45:06,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=463190.0, ans=0.2 2023-10-10 19:45:39,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.582e+02 1.768e+02 1.979e+02 2.564e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-10 19:46:38,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463563.3333333333, ans=0.125 2023-10-10 19:46:41,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=463563.3333333333, ans=0.125 2023-10-10 19:46:55,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463610.0, ans=0.1 2023-10-10 19:47:17,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=463703.3333333333, ans=0.125 2023-10-10 19:47:42,958 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.639e+02 1.900e+02 2.227e+02 3.211e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-10 19:47:48,155 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:47:50,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463843.3333333333, ans=0.1 2023-10-10 19:48:15,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463936.6666666667, ans=0.1 2023-10-10 19:48:27,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=463983.3333333333, ans=0.0 2023-10-10 19:48:29,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=463983.3333333333, ans=0.1 2023-10-10 19:48:33,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.00 vs. limit=15.0 2023-10-10 19:48:34,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=463983.3333333333, ans=0.125 2023-10-10 19:48:38,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=464030.0, ans=0.125 2023-10-10 19:48:43,410 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:48:56,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464123.3333333333, ans=0.1 2023-10-10 19:49:29,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=464216.6666666667, ans=0.1 2023-10-10 19:49:37,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.681e+02 1.880e+02 2.265e+02 3.157e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-10 19:49:37,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=464263.3333333333, ans=0.125 2023-10-10 19:50:03,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=464356.6666666667, ans=0.125 2023-10-10 19:50:03,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=464356.6666666667, ans=0.1 2023-10-10 19:50:04,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=22.5 2023-10-10 19:50:07,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=464356.6666666667, ans=0.125 2023-10-10 19:50:25,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2023-10-10 19:50:29,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=464496.6666666667, ans=0.125 2023-10-10 19:50:33,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-10-10 19:50:36,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=464496.6666666667, ans=0.125 2023-10-10 19:51:00,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=464590.0, ans=0.0 2023-10-10 19:51:16,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=464683.3333333333, ans=0.125 2023-10-10 19:51:27,482 INFO [train.py:1031] (3/4) Epoch 8, batch 4000, loss[loss=0.2324, simple_loss=0.3234, pruned_loss=0.07069, over 16853.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2998, pruned_loss=0.06434, over 28348199.67 frames. ], batch size: 146, lr: 4.64e-03, grad_scale: 32.0 2023-10-10 19:51:28,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464730.0, ans=0.1 2023-10-10 19:51:28,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.04 vs. limit=22.5 2023-10-10 19:51:34,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.298e+02 1.704e+02 1.889e+02 2.167e+02 2.855e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-10 19:51:45,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.46 vs. limit=15.0 2023-10-10 19:51:45,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.45 vs. limit=10.0 2023-10-10 19:51:57,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=464823.3333333333, ans=0.0 2023-10-10 19:52:00,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=22.5 2023-10-10 19:52:06,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=464870.0, ans=0.0 2023-10-10 19:52:07,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=464870.0, ans=0.125 2023-10-10 19:52:25,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=464916.6666666667, ans=0.0 2023-10-10 19:52:46,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=465010.0, ans=0.0 2023-10-10 19:53:15,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=465150.0, ans=0.125 2023-10-10 19:53:20,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=465150.0, ans=0.0 2023-10-10 19:53:25,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=465150.0, ans=0.0 2023-10-10 19:53:31,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=465196.6666666667, ans=0.125 2023-10-10 19:53:32,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.789e+02 1.948e+02 2.277e+02 3.455e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-10 19:53:47,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.88 vs. limit=22.5 2023-10-10 19:54:12,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=465336.6666666667, ans=0.125 2023-10-10 19:54:13,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.13 vs. limit=15.0 2023-10-10 19:54:54,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=465476.6666666667, ans=0.0 2023-10-10 19:55:22,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465570.0, ans=0.1 2023-10-10 19:55:45,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=465663.3333333333, ans=0.1 2023-10-10 19:55:46,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.624e+02 1.791e+02 1.975e+02 2.839e+02, threshold=3.581e+02, percent-clipped=0.0 2023-10-10 19:56:16,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=465756.6666666667, ans=0.125 2023-10-10 19:56:32,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=465850.0, ans=0.125 2023-10-10 19:56:39,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=465850.0, ans=0.125 2023-10-10 19:56:50,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=465896.6666666667, ans=0.1 2023-10-10 19:56:56,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=465943.3333333333, ans=0.2 2023-10-10 19:56:56,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=465943.3333333333, ans=0.0 2023-10-10 19:57:27,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.07 vs. limit=15.0 2023-10-10 19:57:29,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=466083.3333333333, ans=0.125 2023-10-10 19:57:44,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.753e+02 1.904e+02 2.130e+02 3.789e+02, threshold=3.808e+02, percent-clipped=1.0 2023-10-10 19:57:44,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=466130.0, ans=0.2 2023-10-10 19:57:49,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466176.6666666667, ans=0.1 2023-10-10 19:58:05,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=466223.3333333333, ans=0.2 2023-10-10 19:58:12,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=466270.0, ans=0.125 2023-10-10 19:58:32,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=466316.6666666667, ans=0.125 2023-10-10 19:58:43,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=466363.3333333333, ans=0.0 2023-10-10 19:58:44,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.88 vs. limit=10.0 2023-10-10 19:58:47,245 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.157e-03 2023-10-10 19:59:05,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=466456.6666666667, ans=0.0 2023-10-10 19:59:21,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=466503.3333333333, ans=0.125 2023-10-10 19:59:40,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=466596.6666666667, ans=0.035 2023-10-10 19:59:41,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.749e+02 1.937e+02 2.197e+02 3.077e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-10 19:59:47,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-10-10 20:00:03,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=466690.0, ans=0.125 2023-10-10 20:00:44,870 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:00:55,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=466876.6666666667, ans=0.125 2023-10-10 20:00:55,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-10 20:00:59,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=466876.6666666667, ans=10.0 2023-10-10 20:01:03,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=15.0 2023-10-10 20:01:10,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=15.0 2023-10-10 20:01:13,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466923.3333333333, ans=0.125 2023-10-10 20:01:14,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-10 20:01:38,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467016.6666666667, ans=0.1 2023-10-10 20:01:45,080 INFO [train.py:1031] (3/4) Epoch 8, batch 4500, loss[loss=0.2015, simple_loss=0.294, pruned_loss=0.0545, over 16941.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.3001, pruned_loss=0.06435, over 29305781.94 frames. ], batch size: 165, lr: 4.63e-03, grad_scale: 32.0 2023-10-10 20:01:45,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.14 vs. limit=22.5 2023-10-10 20:01:51,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.700e+02 1.957e+02 2.302e+02 3.915e+02, threshold=3.913e+02, percent-clipped=1.0 2023-10-10 20:01:51,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=467063.3333333333, ans=0.125 2023-10-10 20:02:13,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=467156.6666666667, ans=0.09899494936611666 2023-10-10 20:03:12,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.49 vs. limit=10.0 2023-10-10 20:03:21,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=467436.6666666667, ans=0.125 2023-10-10 20:03:25,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=467436.6666666667, ans=0.05 2023-10-10 20:03:25,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=467436.6666666667, ans=0.125 2023-10-10 20:03:46,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.721e+02 1.879e+02 2.167e+02 3.100e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-10 20:03:48,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=467530.0, ans=0.125 2023-10-10 20:03:55,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467576.6666666667, ans=0.1 2023-10-10 20:03:55,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=467576.6666666667, ans=0.125 2023-10-10 20:04:16,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=467670.0, ans=0.05 2023-10-10 20:04:19,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=467670.0, ans=0.0 2023-10-10 20:04:21,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=467670.0, ans=0.125 2023-10-10 20:04:24,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=467716.6666666667, ans=0.0 2023-10-10 20:04:26,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=22.5 2023-10-10 20:04:27,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=467716.6666666667, ans=0.2 2023-10-10 20:04:41,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=467763.3333333333, ans=0.0 2023-10-10 20:04:42,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=467763.3333333333, ans=0.2 2023-10-10 20:05:19,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=467903.3333333333, ans=0.125 2023-10-10 20:05:25,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.45 vs. limit=22.5 2023-10-10 20:05:38,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.753e+02 1.992e+02 2.241e+02 3.342e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-10 20:06:00,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=468090.0, ans=0.0 2023-10-10 20:06:02,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.34 vs. limit=15.0 2023-10-10 20:06:14,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=468136.6666666667, ans=0.0 2023-10-10 20:06:25,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468183.3333333333, ans=0.1 2023-10-10 20:06:35,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=468230.0, ans=0.2 2023-10-10 20:06:40,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=468276.6666666667, ans=0.0 2023-10-10 20:06:42,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=468276.6666666667, ans=0.0 2023-10-10 20:06:50,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=468323.3333333333, ans=0.1 2023-10-10 20:07:09,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468370.0, ans=0.1 2023-10-10 20:07:28,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.693e+02 1.861e+02 2.106e+02 3.160e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-10 20:07:56,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=468556.6666666667, ans=0.035 2023-10-10 20:08:06,764 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=22.5 2023-10-10 20:08:07,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=468603.3333333333, ans=0.125 2023-10-10 20:08:11,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=468603.3333333333, ans=0.0 2023-10-10 20:08:11,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=468603.3333333333, ans=0.125 2023-10-10 20:08:25,859 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:09:11,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468883.3333333333, ans=0.1 2023-10-10 20:09:11,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=22.5 2023-10-10 20:09:26,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.708e+02 1.895e+02 2.121e+02 3.347e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-10 20:09:37,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468976.6666666667, ans=0.1 2023-10-10 20:09:40,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.23 vs. limit=6.0 2023-10-10 20:09:49,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=469023.3333333333, ans=0.125 2023-10-10 20:10:11,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469070.0, ans=0.1 2023-10-10 20:10:55,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=469256.6666666667, ans=0.0 2023-10-10 20:11:06,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=469303.3333333333, ans=0.125 2023-10-10 20:11:19,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=469350.0, ans=0.09899494936611666 2023-10-10 20:11:19,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=469350.0, ans=0.125 2023-10-10 20:11:21,585 INFO [train.py:1031] (3/4) Epoch 8, batch 5000, loss[loss=0.2089, simple_loss=0.2966, pruned_loss=0.06056, over 16898.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2999, pruned_loss=0.06428, over 30121083.55 frames. ], batch size: 165, lr: 4.61e-03, grad_scale: 32.0 2023-10-10 20:11:26,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.737e+02 1.940e+02 2.252e+02 3.066e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 20:11:34,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=469443.3333333333, ans=0.0 2023-10-10 20:11:38,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=469443.3333333333, ans=12.0 2023-10-10 20:11:41,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=469443.3333333333, ans=0.5 2023-10-10 20:11:43,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=15.0 2023-10-10 20:11:46,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=469490.0, ans=0.125 2023-10-10 20:12:37,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=469676.6666666667, ans=0.5 2023-10-10 20:12:52,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=15.0 2023-10-10 20:12:59,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=469770.0, ans=0.0 2023-10-10 20:13:14,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=469816.6666666667, ans=0.0 2023-10-10 20:13:26,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.711e+02 1.855e+02 2.065e+02 2.861e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-10 20:13:45,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=469956.6666666667, ans=0.125 2023-10-10 20:14:27,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=470096.6666666667, ans=0.2 2023-10-10 20:14:30,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=470143.3333333333, ans=0.125 2023-10-10 20:14:54,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=470236.6666666667, ans=0.2 2023-10-10 20:15:12,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=470330.0, ans=0.125 2023-10-10 20:15:18,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.724e+02 1.969e+02 2.190e+02 3.408e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-10 20:15:24,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=470376.6666666667, ans=0.2 2023-10-10 20:15:37,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=470423.3333333333, ans=0.125 2023-10-10 20:15:38,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=470423.3333333333, ans=0.0 2023-10-10 20:15:41,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=470423.3333333333, ans=0.2 2023-10-10 20:15:44,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=470423.3333333333, ans=0.025 2023-10-10 20:15:49,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=470470.0, ans=0.125 2023-10-10 20:16:00,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-10-10 20:16:02,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=470516.6666666667, ans=0.125 2023-10-10 20:16:09,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=470516.6666666667, ans=0.125 2023-10-10 20:16:15,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=470563.3333333333, ans=0.0 2023-10-10 20:16:22,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=470563.3333333333, ans=0.0 2023-10-10 20:17:17,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.659e+02 1.846e+02 2.005e+02 3.082e+02, threshold=3.691e+02, percent-clipped=0.0 2023-10-10 20:17:22,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=470843.3333333333, ans=0.125 2023-10-10 20:17:22,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=470843.3333333333, ans=0.125 2023-10-10 20:17:32,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.29 vs. limit=15.0 2023-10-10 20:17:34,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-10-10 20:17:38,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=470890.0, ans=0.0 2023-10-10 20:17:49,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.78 vs. limit=12.0 2023-10-10 20:18:11,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=471030.0, ans=0.125 2023-10-10 20:18:22,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=471076.6666666667, ans=0.0 2023-10-10 20:18:23,092 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-10 20:18:27,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2023-10-10 20:18:28,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=471076.6666666667, ans=0.2 2023-10-10 20:18:28,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=471076.6666666667, ans=0.125 2023-10-10 20:18:29,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471076.6666666667, ans=0.125 2023-10-10 20:18:41,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=12.0 2023-10-10 20:19:06,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=471216.6666666667, ans=0.125 2023-10-10 20:19:07,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=471216.6666666667, ans=0.025 2023-10-10 20:19:09,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=471216.6666666667, ans=10.0 2023-10-10 20:19:10,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=471263.3333333333, ans=0.125 2023-10-10 20:19:15,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.623e+02 1.777e+02 2.124e+02 2.881e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-10 20:19:15,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=471263.3333333333, ans=0.025 2023-10-10 20:19:25,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=471310.0, ans=0.02 2023-10-10 20:19:28,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=471310.0, ans=0.125 2023-10-10 20:19:39,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-10 20:19:44,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=471403.3333333333, ans=0.1 2023-10-10 20:19:46,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=471403.3333333333, ans=0.0 2023-10-10 20:19:57,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=471450.0, ans=0.125 2023-10-10 20:20:03,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=471450.0, ans=0.125 2023-10-10 20:20:17,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=471543.3333333333, ans=0.125 2023-10-10 20:20:22,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=471543.3333333333, ans=0.05 2023-10-10 20:20:33,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.08 vs. limit=10.0 2023-10-10 20:20:34,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=471590.0, ans=0.125 2023-10-10 20:21:03,684 INFO [train.py:1031] (3/4) Epoch 8, batch 5500, loss[loss=0.2051, simple_loss=0.2826, pruned_loss=0.06381, over 15538.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2997, pruned_loss=0.06419, over 30699775.27 frames. ], batch size: 35, lr: 4.60e-03, grad_scale: 16.0 2023-10-10 20:21:08,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=471730.0, ans=0.0 2023-10-10 20:21:10,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.747e+02 1.995e+02 2.420e+02 4.219e+02, threshold=3.990e+02, percent-clipped=2.0 2023-10-10 20:21:11,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=471730.0, ans=0.125 2023-10-10 20:21:12,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=471730.0, ans=0.125 2023-10-10 20:21:34,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=471823.3333333333, ans=0.125 2023-10-10 20:21:44,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=471870.0, ans=0.125 2023-10-10 20:21:44,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=471870.0, ans=0.05 2023-10-10 20:21:44,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=471870.0, ans=0.125 2023-10-10 20:21:51,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=471916.6666666667, ans=0.0 2023-10-10 20:21:58,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471916.6666666667, ans=0.1 2023-10-10 20:22:21,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-10-10 20:22:23,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=472056.6666666667, ans=0.125 2023-10-10 20:22:52,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=472150.0, ans=0.0 2023-10-10 20:22:55,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=472150.0, ans=0.125 2023-10-10 20:22:59,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=472196.6666666667, ans=0.0 2023-10-10 20:23:04,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.689e+02 1.951e+02 2.231e+02 3.232e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-10 20:23:09,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.02 vs. limit=10.0 2023-10-10 20:23:11,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=472243.3333333333, ans=0.0 2023-10-10 20:23:33,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472336.6666666667, ans=0.1 2023-10-10 20:23:35,474 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.00 vs. limit=22.5 2023-10-10 20:23:57,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-10 20:24:09,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=472476.6666666667, ans=0.125 2023-10-10 20:24:11,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=472476.6666666667, ans=0.2 2023-10-10 20:24:45,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-10 20:24:58,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=472663.3333333333, ans=0.0 2023-10-10 20:25:01,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.765e+02 2.000e+02 2.318e+02 3.494e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-10 20:25:05,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.82 vs. limit=10.0 2023-10-10 20:25:19,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=472756.6666666667, ans=0.1 2023-10-10 20:25:19,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=472756.6666666667, ans=0.125 2023-10-10 20:25:25,703 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:25:28,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=472803.3333333333, ans=0.125 2023-10-10 20:25:58,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472896.6666666667, ans=0.1 2023-10-10 20:26:02,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=22.5 2023-10-10 20:26:09,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=472943.3333333333, ans=0.125 2023-10-10 20:26:46,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=473083.3333333333, ans=0.0 2023-10-10 20:26:54,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=473083.3333333333, ans=0.125 2023-10-10 20:27:05,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.713e+02 1.881e+02 2.225e+02 3.408e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-10 20:27:06,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-10-10 20:27:08,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=473176.6666666667, ans=0.125 2023-10-10 20:27:08,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-10 20:27:41,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.69 vs. limit=6.0 2023-10-10 20:27:58,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=473363.3333333333, ans=0.2 2023-10-10 20:28:04,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473363.3333333333, ans=0.1 2023-10-10 20:28:05,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473363.3333333333, ans=0.1 2023-10-10 20:28:33,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=473503.3333333333, ans=0.125 2023-10-10 20:28:41,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=22.5 2023-10-10 20:28:44,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=473550.0, ans=0.0 2023-10-10 20:28:46,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473550.0, ans=0.1 2023-10-10 20:28:47,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=473550.0, ans=0.0 2023-10-10 20:28:57,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473596.6666666667, ans=0.1 2023-10-10 20:29:01,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473596.6666666667, ans=0.1 2023-10-10 20:29:06,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.645e+02 1.858e+02 2.035e+02 3.365e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-10 20:29:12,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=473643.3333333333, ans=0.2 2023-10-10 20:29:51,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=473783.3333333333, ans=0.0 2023-10-10 20:30:09,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=473876.6666666667, ans=0.125 2023-10-10 20:30:24,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=473923.3333333333, ans=0.05 2023-10-10 20:30:44,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.08 vs. limit=22.5 2023-10-10 20:30:45,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=474016.6666666667, ans=0.0 2023-10-10 20:30:46,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=12.0 2023-10-10 20:30:56,172 INFO [train.py:1031] (3/4) Epoch 8, batch 6000, loss[loss=0.1982, simple_loss=0.2868, pruned_loss=0.05483, over 16689.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.3002, pruned_loss=0.06448, over 31160077.24 frames. ], batch size: 66, lr: 4.59e-03, grad_scale: 32.0 2023-10-10 20:31:04,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.711e+02 1.821e+02 1.995e+02 2.939e+02, threshold=3.642e+02, percent-clipped=0.0 2023-10-10 20:31:18,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474156.6666666667, ans=0.1 2023-10-10 20:31:19,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=474156.6666666667, ans=0.125 2023-10-10 20:31:19,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474156.6666666667, ans=0.1 2023-10-10 20:31:23,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=474156.6666666667, ans=0.1 2023-10-10 20:31:23,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474156.6666666667, ans=0.1 2023-10-10 20:31:24,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=474156.6666666667, ans=0.125 2023-10-10 20:31:41,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=474250.0, ans=0.0 2023-10-10 20:31:50,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=474250.0, ans=0.125 2023-10-10 20:32:15,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474343.3333333333, ans=0.1 2023-10-10 20:32:18,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-10 20:33:03,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.669e+02 1.914e+02 2.181e+02 3.600e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-10 20:33:11,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.58 vs. limit=22.5 2023-10-10 20:33:16,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=474576.6666666667, ans=0.2 2023-10-10 20:33:18,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=474623.3333333333, ans=0.125 2023-10-10 20:33:26,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.87 vs. limit=15.0 2023-10-10 20:33:27,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=474623.3333333333, ans=0.125 2023-10-10 20:33:30,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=474670.0, ans=0.0 2023-10-10 20:33:45,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=474716.6666666667, ans=0.0 2023-10-10 20:33:51,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474716.6666666667, ans=0.1 2023-10-10 20:34:40,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474903.3333333333, ans=0.1 2023-10-10 20:34:40,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-10-10 20:34:51,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=474950.0, ans=0.125 2023-10-10 20:35:02,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.799e+02 2.007e+02 2.466e+02 3.282e+02, threshold=4.013e+02, percent-clipped=0.0 2023-10-10 20:35:25,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=475090.0, ans=0.0 2023-10-10 20:35:52,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=475183.3333333333, ans=0.125 2023-10-10 20:35:56,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=475230.0, ans=0.0 2023-10-10 20:36:18,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=475276.6666666667, ans=0.0 2023-10-10 20:36:22,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=475323.3333333333, ans=0.125 2023-10-10 20:36:27,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=475323.3333333333, ans=0.2 2023-10-10 20:36:36,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=475370.0, ans=0.0 2023-10-10 20:37:03,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475463.3333333333, ans=0.1 2023-10-10 20:37:07,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475463.3333333333, ans=0.125 2023-10-10 20:37:08,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.758e+02 1.947e+02 2.235e+02 3.118e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-10 20:37:09,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=475463.3333333333, ans=0.125 2023-10-10 20:37:09,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=475463.3333333333, ans=0.0 2023-10-10 20:37:10,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=475510.0, ans=0.07 2023-10-10 20:37:21,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=475510.0, ans=0.125 2023-10-10 20:37:36,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=475556.6666666667, ans=0.125 2023-10-10 20:37:39,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=475603.3333333333, ans=0.125 2023-10-10 20:37:40,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=475603.3333333333, ans=0.125 2023-10-10 20:37:42,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475603.3333333333, ans=0.125 2023-10-10 20:38:25,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475743.3333333333, ans=0.125 2023-10-10 20:38:26,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2023-10-10 20:38:29,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=475743.3333333333, ans=0.125 2023-10-10 20:38:50,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=475836.6666666667, ans=0.125 2023-10-10 20:39:06,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-10 20:39:09,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=475883.3333333333, ans=0.0 2023-10-10 20:39:20,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.653e+02 1.864e+02 2.131e+02 2.951e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-10 20:39:37,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=476023.3333333333, ans=0.125 2023-10-10 20:39:47,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=476070.0, ans=0.95 2023-10-10 20:39:56,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=476070.0, ans=0.125 2023-10-10 20:39:59,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=476116.6666666667, ans=0.0 2023-10-10 20:40:07,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=476116.6666666667, ans=0.125 2023-10-10 20:40:11,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476163.3333333333, ans=0.1 2023-10-10 20:40:16,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=476163.3333333333, ans=0.0 2023-10-10 20:40:27,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=476210.0, ans=0.125 2023-10-10 20:40:45,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476303.3333333333, ans=0.1 2023-10-10 20:40:54,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476303.3333333333, ans=0.1 2023-10-10 20:41:07,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=476396.6666666667, ans=0.2 2023-10-10 20:41:08,579 INFO [train.py:1031] (3/4) Epoch 8, batch 6500, loss[loss=0.2112, simple_loss=0.3013, pruned_loss=0.06054, over 16888.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.3007, pruned_loss=0.06458, over 31519304.07 frames. ], batch size: 72, lr: 4.58e-03, grad_scale: 32.0 2023-10-10 20:41:18,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.746e+02 1.963e+02 2.272e+02 3.578e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-10 20:41:24,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=476443.3333333333, ans=0.0 2023-10-10 20:41:32,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476443.3333333333, ans=0.1 2023-10-10 20:41:38,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476490.0, ans=0.1 2023-10-10 20:41:58,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=476536.6666666667, ans=0.125 2023-10-10 20:41:59,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=476536.6666666667, ans=0.0 2023-10-10 20:42:01,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-10-10 20:42:06,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=476583.3333333333, ans=0.0 2023-10-10 20:42:26,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=476630.0, ans=0.2 2023-10-10 20:42:27,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=476676.6666666667, ans=0.125 2023-10-10 20:42:46,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=476723.3333333333, ans=0.125 2023-10-10 20:43:00,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=476770.0, ans=0.025 2023-10-10 20:43:08,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.26 vs. limit=10.0 2023-10-10 20:43:12,751 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:43:24,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=476863.3333333333, ans=0.125 2023-10-10 20:43:26,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.794e+02 2.019e+02 2.201e+02 3.279e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-10 20:43:40,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=476910.0, ans=0.0 2023-10-10 20:43:48,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=476956.6666666667, ans=0.0 2023-10-10 20:44:02,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=477003.3333333333, ans=0.2 2023-10-10 20:44:16,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=477050.0, ans=0.125 2023-10-10 20:44:18,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=477050.0, ans=0.0 2023-10-10 20:44:32,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-10 20:44:35,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=477143.3333333333, ans=0.0 2023-10-10 20:44:59,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.70 vs. limit=10.0 2023-10-10 20:45:08,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=477283.3333333333, ans=0.125 2023-10-10 20:45:09,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=477283.3333333333, ans=0.0 2023-10-10 20:45:10,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=477283.3333333333, ans=0.125 2023-10-10 20:45:22,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=477330.0, ans=0.2 2023-10-10 20:45:23,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=477330.0, ans=0.125 2023-10-10 20:45:25,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.604e+02 1.805e+02 2.018e+02 2.481e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-10 20:45:29,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=477376.6666666667, ans=0.125 2023-10-10 20:45:33,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477376.6666666667, ans=0.1 2023-10-10 20:45:53,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=477470.0, ans=0.125 2023-10-10 20:45:57,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-10-10 20:45:58,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=477470.0, ans=0.125 2023-10-10 20:46:03,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-10-10 20:46:13,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477516.6666666667, ans=0.1 2023-10-10 20:46:38,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=477610.0, ans=0.5 2023-10-10 20:46:40,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=477656.6666666667, ans=0.125 2023-10-10 20:46:43,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=477656.6666666667, ans=0.125 2023-10-10 20:47:00,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=477703.3333333333, ans=0.125 2023-10-10 20:47:03,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=477703.3333333333, ans=0.0 2023-10-10 20:47:30,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.663e+02 1.840e+02 2.017e+02 2.985e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-10 20:47:56,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=477890.0, ans=0.125 2023-10-10 20:47:58,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=477890.0, ans=0.0 2023-10-10 20:48:23,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477983.3333333333, ans=0.0 2023-10-10 20:48:31,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=477983.3333333333, ans=0.125 2023-10-10 20:48:38,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=478030.0, ans=0.125 2023-10-10 20:48:49,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=478076.6666666667, ans=0.0 2023-10-10 20:49:14,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478170.0, ans=0.125 2023-10-10 20:49:14,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=478170.0, ans=0.1 2023-10-10 20:49:26,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=478216.6666666667, ans=0.125 2023-10-10 20:49:28,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.13 vs. limit=15.0 2023-10-10 20:49:29,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=478216.6666666667, ans=0.0 2023-10-10 20:49:35,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=478216.6666666667, ans=0.2 2023-10-10 20:49:46,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-10 20:49:47,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.642e+02 1.812e+02 2.106e+02 3.113e+02, threshold=3.625e+02, percent-clipped=0.0 2023-10-10 20:50:11,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=478403.3333333333, ans=0.125 2023-10-10 20:50:31,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=478496.6666666667, ans=0.125 2023-10-10 20:50:37,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=478496.6666666667, ans=0.125 2023-10-10 20:50:42,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=478543.3333333333, ans=0.0 2023-10-10 20:50:48,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2023-10-10 20:50:55,227 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:50:57,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=478590.0, ans=0.0 2023-10-10 20:51:26,679 INFO [train.py:1031] (3/4) Epoch 8, batch 7000, loss[loss=0.1859, simple_loss=0.2751, pruned_loss=0.04835, over 15968.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.3008, pruned_loss=0.06424, over 31807765.63 frames. ], batch size: 43, lr: 4.57e-03, grad_scale: 16.0 2023-10-10 20:51:27,157 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-10-10 20:51:28,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=478730.0, ans=0.0 2023-10-10 20:51:38,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.754e+02 1.865e+02 2.126e+02 2.843e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-10 20:51:41,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=478776.6666666667, ans=0.0 2023-10-10 20:51:48,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=478776.6666666667, ans=0.125 2023-10-10 20:52:16,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=478870.0, ans=0.125 2023-10-10 20:52:24,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=478916.6666666667, ans=0.125 2023-10-10 20:52:46,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=479010.0, ans=0.125 2023-10-10 20:53:17,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=479150.0, ans=0.0 2023-10-10 20:53:20,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=479150.0, ans=0.125 2023-10-10 20:53:32,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=479196.6666666667, ans=0.125 2023-10-10 20:53:35,613 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:53:37,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.730e+02 1.929e+02 2.124e+02 2.702e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-10 20:53:46,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=479243.3333333333, ans=0.125 2023-10-10 20:54:47,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=479523.3333333333, ans=0.125 2023-10-10 20:55:00,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2023-10-10 20:55:02,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-10-10 20:55:02,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=479570.0, ans=0.125 2023-10-10 20:55:16,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=479616.6666666667, ans=0.0 2023-10-10 20:55:19,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=479616.6666666667, ans=0.125 2023-10-10 20:55:25,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479663.3333333333, ans=0.0 2023-10-10 20:55:29,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479663.3333333333, ans=0.125 2023-10-10 20:55:38,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.725e+02 1.978e+02 2.285e+02 3.639e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-10 20:56:21,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=479850.0, ans=0.5 2023-10-10 20:57:52,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.627e+02 1.808e+02 2.015e+02 2.618e+02, threshold=3.616e+02, percent-clipped=0.0 2023-10-10 20:58:10,375 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:58:17,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=480223.3333333333, ans=0.04949747468305833 2023-10-10 20:58:36,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=15.0 2023-10-10 20:58:59,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=480410.0, ans=0.125 2023-10-10 20:58:59,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=480410.0, ans=0.125 2023-10-10 20:59:03,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=480410.0, ans=0.125 2023-10-10 20:59:20,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=480456.6666666667, ans=15.0 2023-10-10 20:59:23,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=480503.3333333333, ans=0.125 2023-10-10 20:59:24,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2023-10-10 20:59:47,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=480596.6666666667, ans=0.0 2023-10-10 20:59:55,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.683e+02 1.914e+02 2.576e+02 4.052e+02, threshold=3.828e+02, percent-clipped=3.0 2023-10-10 21:00:00,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=480643.3333333333, ans=0.2 2023-10-10 21:00:04,325 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:00:10,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.17 vs. limit=10.0 2023-10-10 21:00:19,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=480736.6666666667, ans=0.125 2023-10-10 21:00:25,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=480736.6666666667, ans=0.125 2023-10-10 21:00:32,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-10 21:00:41,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=480830.0, ans=0.2 2023-10-10 21:00:54,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=480876.6666666667, ans=0.125 2023-10-10 21:00:54,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=480876.6666666667, ans=0.0 2023-10-10 21:01:28,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.71 vs. limit=22.5 2023-10-10 21:01:33,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=481016.6666666667, ans=0.0 2023-10-10 21:01:40,266 INFO [train.py:1031] (3/4) Epoch 8, batch 7500, loss[loss=0.2128, simple_loss=0.3012, pruned_loss=0.0622, over 16883.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.3005, pruned_loss=0.06421, over 31997734.91 frames. ], batch size: 130, lr: 4.56e-03, grad_scale: 16.0 2023-10-10 21:01:50,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.703e+02 1.901e+02 2.198e+02 3.016e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 21:02:06,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481156.6666666667, ans=0.1 2023-10-10 21:02:16,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=481203.3333333333, ans=0.125 2023-10-10 21:02:22,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.79 vs. limit=15.0 2023-10-10 21:02:28,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=481250.0, ans=0.0 2023-10-10 21:02:49,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=481296.6666666667, ans=0.125 2023-10-10 21:02:56,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=481343.3333333333, ans=0.125 2023-10-10 21:03:21,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2023-10-10 21:03:32,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=481483.3333333333, ans=0.0 2023-10-10 21:03:36,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=481483.3333333333, ans=0.0 2023-10-10 21:03:37,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=481530.0, ans=0.2 2023-10-10 21:03:40,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=481530.0, ans=0.2 2023-10-10 21:03:48,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.641e+02 1.846e+02 2.076e+02 2.938e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-10 21:03:50,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=481576.6666666667, ans=0.025 2023-10-10 21:03:57,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481576.6666666667, ans=0.1 2023-10-10 21:04:05,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481623.3333333333, ans=0.0 2023-10-10 21:04:06,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=481623.3333333333, ans=0.125 2023-10-10 21:04:07,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=481623.3333333333, ans=0.125 2023-10-10 21:04:17,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.34 vs. limit=10.0 2023-10-10 21:04:32,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=481670.0, ans=0.025 2023-10-10 21:04:35,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=481670.0, ans=0.0 2023-10-10 21:04:50,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-10-10 21:04:51,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=481763.3333333333, ans=0.0 2023-10-10 21:05:31,103 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:05:31,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=481903.3333333333, ans=0.125 2023-10-10 21:05:39,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-10-10 21:05:46,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-10 21:06:00,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.738e+02 1.959e+02 2.216e+02 3.155e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-10 21:06:05,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=482043.3333333333, ans=0.0 2023-10-10 21:06:44,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482183.3333333333, ans=0.1 2023-10-10 21:06:46,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=482183.3333333333, ans=0.125 2023-10-10 21:07:20,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.24 vs. limit=22.5 2023-10-10 21:07:33,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-10-10 21:07:41,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=482416.6666666667, ans=0.125 2023-10-10 21:07:57,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.783e+02 2.049e+02 2.311e+02 3.170e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-10 21:08:00,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=482510.0, ans=0.125 2023-10-10 21:08:18,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=482556.6666666667, ans=0.125 2023-10-10 21:08:44,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.28 vs. limit=15.0 2023-10-10 21:08:54,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=482696.6666666667, ans=0.0 2023-10-10 21:08:57,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=482696.6666666667, ans=0.0 2023-10-10 21:09:55,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=482930.0, ans=0.125 2023-10-10 21:09:58,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.688e+02 1.853e+02 2.144e+02 3.308e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 21:10:03,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=482976.6666666667, ans=0.05 2023-10-10 21:10:07,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=482976.6666666667, ans=0.125 2023-10-10 21:10:11,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=483023.3333333333, ans=0.125 2023-10-10 21:10:31,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=483070.0, ans=0.125 2023-10-10 21:11:12,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=483256.6666666667, ans=0.125 2023-10-10 21:11:25,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.33 vs. limit=10.0 2023-10-10 21:11:46,514 INFO [train.py:1031] (3/4) Epoch 8, batch 8000, loss[loss=0.2416, simple_loss=0.3035, pruned_loss=0.08983, over 15684.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2998, pruned_loss=0.0635, over 32204872.12 frames. ], batch size: 350, lr: 4.55e-03, grad_scale: 32.0 2023-10-10 21:11:57,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.313e+02 1.553e+02 1.723e+02 1.912e+02 2.996e+02, threshold=3.446e+02, percent-clipped=0.0 2023-10-10 21:12:05,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=483443.3333333333, ans=0.125 2023-10-10 21:12:16,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=483490.0, ans=0.0 2023-10-10 21:12:53,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:12:55,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:13:00,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:13:02,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2023-10-10 21:13:12,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=483723.3333333333, ans=0.0 2023-10-10 21:13:23,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=483770.0, ans=0.125 2023-10-10 21:13:24,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=483770.0, ans=0.125 2023-10-10 21:13:33,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.74 vs. limit=15.0 2023-10-10 21:13:40,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=483863.3333333333, ans=0.125 2023-10-10 21:13:41,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=483863.3333333333, ans=0.125 2023-10-10 21:13:47,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.268e+02 1.855e+02 2.088e+02 2.446e+02 4.292e+02, threshold=4.177e+02, percent-clipped=2.0 2023-10-10 21:13:50,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=483910.0, ans=0.2 2023-10-10 21:14:04,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=483956.6666666667, ans=0.2 2023-10-10 21:14:12,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484003.3333333333, ans=0.1 2023-10-10 21:14:40,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.63 vs. limit=22.5 2023-10-10 21:14:58,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=484143.3333333333, ans=0.125 2023-10-10 21:14:59,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=484143.3333333333, ans=0.2 2023-10-10 21:15:02,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=484143.3333333333, ans=0.125 2023-10-10 21:15:06,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=484143.3333333333, ans=0.0 2023-10-10 21:15:18,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=484190.0, ans=0.0 2023-10-10 21:15:20,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=484190.0, ans=0.125 2023-10-10 21:15:20,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=484190.0, ans=0.0 2023-10-10 21:15:30,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=484236.6666666667, ans=0.0 2023-10-10 21:15:49,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484330.0, ans=0.1 2023-10-10 21:15:53,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=484330.0, ans=0.0 2023-10-10 21:16:01,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.722e+02 1.890e+02 2.150e+02 3.149e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-10 21:16:15,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-10-10 21:16:21,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-10-10 21:17:01,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=484610.0, ans=0.05 2023-10-10 21:17:05,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484610.0, ans=0.1 2023-10-10 21:17:05,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=484610.0, ans=0.2 2023-10-10 21:17:18,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484656.6666666667, ans=0.1 2023-10-10 21:17:47,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484796.6666666667, ans=0.1 2023-10-10 21:17:58,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.629e+02 1.751e+02 1.999e+02 2.601e+02, threshold=3.502e+02, percent-clipped=0.0 2023-10-10 21:18:06,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-10 21:18:11,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484890.0, ans=0.1 2023-10-10 21:18:13,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=484890.0, ans=0.2 2023-10-10 21:18:32,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=484983.3333333333, ans=0.0 2023-10-10 21:18:35,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.65 vs. limit=22.5 2023-10-10 21:18:43,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=485030.0, ans=0.125 2023-10-10 21:18:44,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=485030.0, ans=0.0 2023-10-10 21:18:46,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=485030.0, ans=0.125 2023-10-10 21:18:48,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=485030.0, ans=0.125 2023-10-10 21:19:03,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=485123.3333333333, ans=0.2 2023-10-10 21:19:05,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=485123.3333333333, ans=0.09899494936611666 2023-10-10 21:19:17,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=485170.0, ans=0.125 2023-10-10 21:19:28,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485216.6666666667, ans=0.1 2023-10-10 21:19:34,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-10-10 21:19:44,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=485263.3333333333, ans=0.04949747468305833 2023-10-10 21:19:51,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.685e+02 1.865e+02 2.061e+02 2.637e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-10 21:20:21,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485403.3333333333, ans=0.1 2023-10-10 21:20:23,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=485403.3333333333, ans=0.2 2023-10-10 21:20:26,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.67 vs. limit=12.0 2023-10-10 21:20:26,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.04 vs. limit=10.0 2023-10-10 21:20:27,689 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-10-10 21:20:38,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=485450.0, ans=0.2 2023-10-10 21:21:11,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=485590.0, ans=0.125 2023-10-10 21:21:35,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=485683.3333333333, ans=0.125 2023-10-10 21:21:43,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.50 vs. limit=15.0 2023-10-10 21:21:44,033 INFO [train.py:1031] (3/4) Epoch 8, batch 8500, loss[loss=0.2315, simple_loss=0.3142, pruned_loss=0.07441, over 16921.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.3001, pruned_loss=0.06345, over 32325145.58 frames. ], batch size: 110, lr: 4.54e-03, grad_scale: 32.0 2023-10-10 21:21:48,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=485730.0, ans=0.0 2023-10-10 21:21:54,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485776.6666666667, ans=0.1 2023-10-10 21:21:54,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.773e+02 2.025e+02 2.320e+02 3.386e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-10 21:22:12,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=485823.3333333333, ans=0.04949747468305833 2023-10-10 21:22:21,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-10-10 21:23:27,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=22.5 2023-10-10 21:23:31,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=486150.0, ans=0.125 2023-10-10 21:23:55,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=486196.6666666667, ans=0.0 2023-10-10 21:24:02,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.722e+02 2.039e+02 2.371e+02 3.182e+02, threshold=4.077e+02, percent-clipped=0.0 2023-10-10 21:24:24,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=486336.6666666667, ans=0.125 2023-10-10 21:24:32,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.54 vs. limit=10.0 2023-10-10 21:24:37,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=486383.3333333333, ans=10.0 2023-10-10 21:24:38,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486383.3333333333, ans=0.125 2023-10-10 21:24:45,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=486383.3333333333, ans=0.0 2023-10-10 21:24:48,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-10-10 21:24:59,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=486476.6666666667, ans=0.04949747468305833 2023-10-10 21:25:05,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=486476.6666666667, ans=0.125 2023-10-10 21:25:15,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=486523.3333333333, ans=0.5 2023-10-10 21:25:27,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=486570.0, ans=0.0 2023-10-10 21:25:28,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=486570.0, ans=0.125 2023-10-10 21:25:44,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2023-10-10 21:25:46,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=486616.6666666667, ans=0.125 2023-10-10 21:26:00,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=486663.3333333333, ans=0.0 2023-10-10 21:26:00,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-10-10 21:26:03,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.34 vs. limit=10.0 2023-10-10 21:26:04,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.227e+02 1.600e+02 1.762e+02 1.915e+02 2.781e+02, threshold=3.523e+02, percent-clipped=0.0 2023-10-10 21:26:06,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=486710.0, ans=0.125 2023-10-10 21:26:10,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=486710.0, ans=0.125 2023-10-10 21:26:12,524 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=12.0 2023-10-10 21:26:22,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-10-10 21:26:25,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=486803.3333333333, ans=0.2 2023-10-10 21:26:26,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2023-10-10 21:26:27,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=486803.3333333333, ans=0.05 2023-10-10 21:26:32,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=486803.3333333333, ans=0.125 2023-10-10 21:26:35,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486803.3333333333, ans=0.1 2023-10-10 21:26:40,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=15.0 2023-10-10 21:26:56,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=486896.6666666667, ans=0.2 2023-10-10 21:27:12,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=486943.3333333333, ans=0.2 2023-10-10 21:27:14,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=486943.3333333333, ans=0.0 2023-10-10 21:27:21,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=486990.0, ans=0.125 2023-10-10 21:27:25,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486990.0, ans=0.1 2023-10-10 21:27:53,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487130.0, ans=0.1 2023-10-10 21:28:07,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.300e+02 1.597e+02 1.815e+02 2.037e+02 3.283e+02, threshold=3.630e+02, percent-clipped=0.0 2023-10-10 21:28:17,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=487223.3333333333, ans=0.125 2023-10-10 21:28:25,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-10 21:28:27,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=487223.3333333333, ans=0.0 2023-10-10 21:28:30,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=487270.0, ans=0.125 2023-10-10 21:28:45,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.53 vs. limit=6.0 2023-10-10 21:28:48,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=487316.6666666667, ans=0.125 2023-10-10 21:28:54,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=487363.3333333333, ans=0.125 2023-10-10 21:29:05,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.51 vs. limit=15.0 2023-10-10 21:29:19,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=487456.6666666667, ans=0.125 2023-10-10 21:29:19,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=487456.6666666667, ans=0.2 2023-10-10 21:29:26,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-10-10 21:29:42,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=487550.0, ans=0.125 2023-10-10 21:29:51,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=487596.6666666667, ans=0.0 2023-10-10 21:29:53,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=487596.6666666667, ans=0.0 2023-10-10 21:29:59,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.748e+02 1.941e+02 2.174e+02 3.709e+02, threshold=3.882e+02, percent-clipped=1.0 2023-10-10 21:30:02,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=487643.3333333333, ans=0.0 2023-10-10 21:30:08,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=487643.3333333333, ans=0.05 2023-10-10 21:30:20,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=487736.6666666667, ans=0.125 2023-10-10 21:30:34,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=22.5 2023-10-10 21:30:48,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=487830.0, ans=0.1 2023-10-10 21:30:49,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2023-10-10 21:30:54,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=487876.6666666667, ans=0.2 2023-10-10 21:31:05,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=487923.3333333333, ans=0.0 2023-10-10 21:31:21,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487970.0, ans=0.1 2023-10-10 21:31:30,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=488016.6666666667, ans=0.2 2023-10-10 21:31:35,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.71 vs. limit=22.5 2023-10-10 21:31:37,762 INFO [train.py:1031] (3/4) Epoch 8, batch 9000, loss[loss=0.2108, simple_loss=0.3022, pruned_loss=0.05964, over 16871.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2992, pruned_loss=0.06296, over 32440491.19 frames. ], batch size: 98, lr: 4.53e-03, grad_scale: 32.0 2023-10-10 21:31:38,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=488063.3333333333, ans=0.125 2023-10-10 21:31:38,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488063.3333333333, ans=0.1 2023-10-10 21:31:39,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=488063.3333333333, ans=0.05 2023-10-10 21:31:48,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=488110.0, ans=0.125 2023-10-10 21:31:49,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.802e+02 1.982e+02 2.288e+02 3.379e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-10 21:31:59,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=488156.6666666667, ans=0.0 2023-10-10 21:32:11,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=488203.3333333333, ans=0.125 2023-10-10 21:32:26,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=488250.0, ans=0.07 2023-10-10 21:32:29,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-10-10 21:32:35,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=488296.6666666667, ans=0.0 2023-10-10 21:32:54,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=488390.0, ans=0.1 2023-10-10 21:32:56,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=488390.0, ans=0.125 2023-10-10 21:33:05,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=488436.6666666667, ans=0.0 2023-10-10 21:33:28,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=488530.0, ans=0.07 2023-10-10 21:33:29,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=488530.0, ans=0.125 2023-10-10 21:33:31,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=488530.0, ans=0.0 2023-10-10 21:33:34,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=488576.6666666667, ans=0.2 2023-10-10 21:33:36,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.280e+02 1.683e+02 1.925e+02 2.213e+02 3.576e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-10 21:33:46,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488623.3333333333, ans=0.0 2023-10-10 21:33:51,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=488623.3333333333, ans=0.0 2023-10-10 21:34:06,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=488670.0, ans=0.125 2023-10-10 21:34:23,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=488763.3333333333, ans=0.0 2023-10-10 21:34:24,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488763.3333333333, ans=0.1 2023-10-10 21:34:31,369 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=22.5 2023-10-10 21:34:49,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=488903.3333333333, ans=0.125 2023-10-10 21:34:54,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=488903.3333333333, ans=0.2 2023-10-10 21:34:58,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=488950.0, ans=0.2 2023-10-10 21:35:06,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=488950.0, ans=0.125 2023-10-10 21:35:08,052 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2023-10-10 21:35:21,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=489043.3333333333, ans=0.125 2023-10-10 21:35:22,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.814e+02 1.967e+02 2.239e+02 3.286e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-10 21:35:26,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=489043.3333333333, ans=0.0 2023-10-10 21:35:26,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-10-10 21:35:28,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=489043.3333333333, ans=0.125 2023-10-10 21:35:31,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=489043.3333333333, ans=0.125 2023-10-10 21:35:31,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=489090.0, ans=0.125 2023-10-10 21:35:36,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=489090.0, ans=0.125 2023-10-10 21:35:38,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=489090.0, ans=0.1 2023-10-10 21:36:10,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=489230.0, ans=0.0 2023-10-10 21:36:37,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=489370.0, ans=0.0 2023-10-10 21:36:46,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=489416.6666666667, ans=0.035 2023-10-10 21:36:46,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=489416.6666666667, ans=0.0 2023-10-10 21:36:49,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=489416.6666666667, ans=0.125 2023-10-10 21:37:07,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.812e+02 1.975e+02 2.235e+02 3.214e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-10 21:37:07,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=489510.0, ans=0.125 2023-10-10 21:37:11,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-10-10 21:37:13,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-10 21:37:24,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=489556.6666666667, ans=0.0 2023-10-10 21:37:26,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=489556.6666666667, ans=0.2 2023-10-10 21:37:27,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=489556.6666666667, ans=0.0 2023-10-10 21:37:45,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-10-10 21:38:02,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-10-10 21:38:16,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=489743.3333333333, ans=0.0 2023-10-10 21:38:52,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=489883.3333333333, ans=0.125 2023-10-10 21:38:56,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=489883.3333333333, ans=0.0 2023-10-10 21:38:57,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.71 vs. limit=22.5 2023-10-10 21:38:58,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=489930.0, ans=0.125 2023-10-10 21:39:12,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.728e+02 1.912e+02 2.225e+02 3.045e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 21:39:13,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.97 vs. limit=15.0 2023-10-10 21:39:14,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.58 vs. limit=15.0 2023-10-10 21:39:36,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=490070.0, ans=0.125 2023-10-10 21:39:39,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=490070.0, ans=0.125 2023-10-10 21:40:00,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=490163.3333333333, ans=0.125 2023-10-10 21:40:06,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=490163.3333333333, ans=0.125 2023-10-10 21:40:08,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=490163.3333333333, ans=0.0 2023-10-10 21:40:28,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.17 vs. limit=15.0 2023-10-10 21:40:40,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.72 vs. limit=22.5 2023-10-10 21:40:58,637 INFO [train.py:1031] (3/4) Epoch 8, batch 9500, loss[loss=0.2165, simple_loss=0.3065, pruned_loss=0.06325, over 16846.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2999, pruned_loss=0.06331, over 32502177.46 frames. ], batch size: 110, lr: 4.52e-03, grad_scale: 16.0 2023-10-10 21:41:02,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=490396.6666666667, ans=0.125 2023-10-10 21:41:05,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=15.0 2023-10-10 21:41:11,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.762e+02 2.058e+02 2.509e+02 4.568e+02, threshold=4.117e+02, percent-clipped=7.0 2023-10-10 21:41:14,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=15.0 2023-10-10 21:41:18,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=490443.3333333333, ans=0.2 2023-10-10 21:41:33,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=490536.6666666667, ans=0.07 2023-10-10 21:41:43,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=490583.3333333333, ans=0.125 2023-10-10 21:42:52,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=490863.3333333333, ans=0.2 2023-10-10 21:43:04,873 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.705e+02 1.852e+02 2.033e+02 2.655e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 21:44:09,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=491143.3333333333, ans=0.125 2023-10-10 21:44:21,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491190.0, ans=0.1 2023-10-10 21:44:33,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=491283.3333333333, ans=0.0 2023-10-10 21:44:38,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=491283.3333333333, ans=0.125 2023-10-10 21:44:42,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=491283.3333333333, ans=0.2 2023-10-10 21:44:57,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.689e+02 1.867e+02 2.130e+02 2.848e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-10 21:45:26,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=491470.0, ans=0.0 2023-10-10 21:45:29,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=491516.6666666667, ans=0.0 2023-10-10 21:45:39,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=491563.3333333333, ans=0.2 2023-10-10 21:46:35,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.38 vs. limit=10.0 2023-10-10 21:46:35,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.73 vs. limit=10.0 2023-10-10 21:46:48,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.20 vs. limit=22.5 2023-10-10 21:46:50,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=491796.6666666667, ans=0.07 2023-10-10 21:46:54,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.654e+02 1.829e+02 2.208e+02 3.121e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-10 21:46:54,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=491843.3333333333, ans=0.0 2023-10-10 21:47:02,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=491843.3333333333, ans=0.1 2023-10-10 21:47:30,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=491983.3333333333, ans=0.1 2023-10-10 21:47:33,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=491983.3333333333, ans=0.1 2023-10-10 21:47:45,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.41 vs. limit=22.5 2023-10-10 21:48:01,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=492123.3333333333, ans=0.2 2023-10-10 21:48:02,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=492123.3333333333, ans=0.125 2023-10-10 21:48:03,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.81 vs. limit=22.5 2023-10-10 21:48:04,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=492123.3333333333, ans=0.125 2023-10-10 21:48:08,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=22.5 2023-10-10 21:48:15,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=492170.0, ans=0.125 2023-10-10 21:48:24,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.32 vs. limit=15.0 2023-10-10 21:48:48,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.765e+02 1.966e+02 2.337e+02 3.374e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-10 21:48:58,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=15.0 2023-10-10 21:49:06,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=492356.6666666667, ans=0.0 2023-10-10 21:49:19,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=492450.0, ans=0.0 2023-10-10 21:49:24,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.93 vs. limit=15.0 2023-10-10 21:49:27,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=492496.6666666667, ans=0.125 2023-10-10 21:49:46,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=492543.3333333333, ans=0.125 2023-10-10 21:49:57,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=492590.0, ans=0.125 2023-10-10 21:49:58,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=492590.0, ans=0.125 2023-10-10 21:49:59,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=492636.6666666667, ans=0.1 2023-10-10 21:50:22,796 INFO [train.py:1031] (3/4) Epoch 8, batch 10000, loss[loss=0.2112, simple_loss=0.2979, pruned_loss=0.0623, over 16704.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2989, pruned_loss=0.06281, over 32577045.39 frames. ], batch size: 81, lr: 4.50e-03, grad_scale: 32.0 2023-10-10 21:50:32,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=492776.6666666667, ans=0.125 2023-10-10 21:50:35,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.656e+02 1.863e+02 2.129e+02 3.621e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-10 21:51:23,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.61 vs. limit=15.0 2023-10-10 21:51:55,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-10 21:52:03,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=493103.3333333333, ans=0.125 2023-10-10 21:52:30,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.822e+02 2.037e+02 2.406e+02 3.425e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 21:52:31,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=493243.3333333333, ans=0.125 2023-10-10 21:53:01,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=493336.6666666667, ans=0.0 2023-10-10 21:53:16,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-10-10 21:53:22,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=493430.0, ans=0.0 2023-10-10 21:53:39,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=493523.3333333333, ans=0.5 2023-10-10 21:53:48,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493570.0, ans=0.1 2023-10-10 21:54:07,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=493616.6666666667, ans=0.125 2023-10-10 21:54:27,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=493710.0, ans=0.2 2023-10-10 21:54:29,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.664e+02 1.870e+02 2.090e+02 3.364e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-10 21:54:44,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=493756.6666666667, ans=0.2 2023-10-10 21:54:44,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.84 vs. limit=15.0 2023-10-10 21:54:50,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2023-10-10 21:55:00,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=493850.0, ans=0.125 2023-10-10 21:55:07,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.88 vs. limit=22.5 2023-10-10 21:55:17,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=493896.6666666667, ans=0.5 2023-10-10 21:55:19,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=493896.6666666667, ans=0.125 2023-10-10 21:55:40,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=493990.0, ans=0.125 2023-10-10 21:55:49,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=494036.6666666667, ans=0.125 2023-10-10 21:55:56,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=494036.6666666667, ans=0.125 2023-10-10 21:56:03,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=494083.3333333333, ans=0.0 2023-10-10 21:56:09,215 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.43 vs. limit=15.0 2023-10-10 21:56:10,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=494130.0, ans=0.0 2023-10-10 21:56:10,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=494130.0, ans=0.125 2023-10-10 21:56:14,128 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:56:21,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=494176.6666666667, ans=0.125 2023-10-10 21:56:24,365 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=15.0 2023-10-10 21:56:25,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-10 21:56:26,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.723e+02 1.942e+02 2.179e+02 3.073e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 21:56:38,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-10-10 21:57:06,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=494316.6666666667, ans=0.125 2023-10-10 21:58:02,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=15.0 2023-10-10 21:58:22,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.608e+02 1.751e+02 1.898e+02 2.611e+02, threshold=3.502e+02, percent-clipped=0.0 2023-10-10 21:58:27,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=494643.3333333333, ans=0.0 2023-10-10 21:58:27,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.35 vs. limit=10.0 2023-10-10 21:58:39,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.88 vs. limit=15.0 2023-10-10 21:58:52,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=494736.6666666667, ans=0.125 2023-10-10 21:58:57,480 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-10-10 21:59:18,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=494876.6666666667, ans=0.0 2023-10-10 21:59:40,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=494923.3333333333, ans=0.0 2023-10-10 21:59:53,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=495016.6666666667, ans=0.125 2023-10-10 21:59:56,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=495016.6666666667, ans=0.125 2023-10-10 22:00:02,729 INFO [train.py:1031] (3/4) Epoch 8, batch 10500, loss[loss=0.1813, simple_loss=0.2779, pruned_loss=0.04235, over 16902.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2995, pruned_loss=0.06287, over 32636733.87 frames. ], batch size: 104, lr: 4.49e-03, grad_scale: 16.0 2023-10-10 22:00:06,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=495063.3333333333, ans=0.125 2023-10-10 22:00:17,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.669e+02 1.876e+02 2.117e+02 3.190e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-10 22:00:18,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495110.0, ans=0.1 2023-10-10 22:00:21,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=495110.0, ans=0.0 2023-10-10 22:00:21,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=495110.0, ans=0.0 2023-10-10 22:00:23,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495156.6666666667, ans=0.1 2023-10-10 22:00:28,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=495156.6666666667, ans=0.125 2023-10-10 22:00:30,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=495156.6666666667, ans=0.125 2023-10-10 22:00:31,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=495156.6666666667, ans=0.125 2023-10-10 22:00:33,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=495156.6666666667, ans=0.035 2023-10-10 22:01:11,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=495343.3333333333, ans=0.0 2023-10-10 22:01:31,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=495390.0, ans=0.125 2023-10-10 22:01:43,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=495436.6666666667, ans=0.125 2023-10-10 22:01:44,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=495436.6666666667, ans=0.125 2023-10-10 22:02:02,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=495530.0, ans=0.0 2023-10-10 22:02:15,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495576.6666666667, ans=0.125 2023-10-10 22:02:20,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.661e+02 1.844e+02 2.014e+02 2.750e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 22:02:22,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-10-10 22:02:31,405 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.786e-02 2023-10-10 22:02:45,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=495670.0, ans=0.125 2023-10-10 22:02:49,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=495716.6666666667, ans=0.07 2023-10-10 22:02:49,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.67 vs. limit=10.0 2023-10-10 22:02:53,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=495716.6666666667, ans=0.0 2023-10-10 22:02:53,215 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:03:06,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=495763.3333333333, ans=0.125 2023-10-10 22:03:20,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=495856.6666666667, ans=0.2 2023-10-10 22:03:29,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495856.6666666667, ans=0.1 2023-10-10 22:03:43,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=495903.3333333333, ans=0.125 2023-10-10 22:03:55,865 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-10-10 22:04:07,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=495996.6666666667, ans=10.0 2023-10-10 22:04:14,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.588e+02 1.733e+02 2.027e+02 2.793e+02, threshold=3.466e+02, percent-clipped=0.0 2023-10-10 22:04:22,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.62 vs. limit=22.5 2023-10-10 22:04:25,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=496090.0, ans=0.125 2023-10-10 22:05:04,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2023-10-10 22:05:06,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=496276.6666666667, ans=0.09899494936611666 2023-10-10 22:05:19,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=496323.3333333333, ans=0.0 2023-10-10 22:05:24,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=496323.3333333333, ans=0.2 2023-10-10 22:05:44,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=496416.6666666667, ans=0.025 2023-10-10 22:05:54,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496463.3333333333, ans=0.1 2023-10-10 22:06:02,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=496510.0, ans=0.95 2023-10-10 22:06:06,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.889e+02 2.123e+02 2.345e+02 3.364e+02, threshold=4.245e+02, percent-clipped=0.0 2023-10-10 22:06:06,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=496510.0, ans=0.125 2023-10-10 22:06:07,984 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:06:32,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=496603.3333333333, ans=0.2 2023-10-10 22:06:34,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=496650.0, ans=0.0 2023-10-10 22:06:48,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=496696.6666666667, ans=0.0 2023-10-10 22:06:52,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=496696.6666666667, ans=0.125 2023-10-10 22:06:54,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=496743.3333333333, ans=0.1 2023-10-10 22:07:46,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.32 vs. limit=15.0 2023-10-10 22:07:58,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.315e+02 1.646e+02 1.893e+02 2.254e+02 3.198e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-10 22:07:59,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496976.6666666667, ans=0.0 2023-10-10 22:08:05,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=497023.3333333333, ans=0.2 2023-10-10 22:08:06,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=497023.3333333333, ans=0.125 2023-10-10 22:08:16,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-10-10 22:08:36,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-10-10 22:08:48,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.39 vs. limit=15.0 2023-10-10 22:08:58,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-10-10 22:09:01,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=497256.6666666667, ans=0.125 2023-10-10 22:09:11,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=497256.6666666667, ans=0.05 2023-10-10 22:09:37,032 INFO [train.py:1031] (3/4) Epoch 8, batch 11000, loss[loss=0.2069, simple_loss=0.2923, pruned_loss=0.06074, over 16606.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2994, pruned_loss=0.06275, over 32684023.86 frames. ], batch size: 66, lr: 4.48e-03, grad_scale: 32.0 2023-10-10 22:09:45,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-10-10 22:09:46,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=497396.6666666667, ans=0.0 2023-10-10 22:09:50,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=497443.3333333333, ans=0.1 2023-10-10 22:09:52,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.938e+02 2.212e+02 2.515e+02 3.317e+02, threshold=4.424e+02, percent-clipped=0.0 2023-10-10 22:09:58,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=497490.0, ans=0.2 2023-10-10 22:10:08,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=497536.6666666667, ans=0.125 2023-10-10 22:10:22,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=497583.3333333333, ans=0.125 2023-10-10 22:10:22,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497583.3333333333, ans=0.1 2023-10-10 22:10:39,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=497630.0, ans=0.125 2023-10-10 22:10:53,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=497676.6666666667, ans=0.1 2023-10-10 22:10:57,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=497723.3333333333, ans=0.0 2023-10-10 22:11:17,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=497770.0, ans=0.125 2023-10-10 22:11:38,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=497863.3333333333, ans=0.125 2023-10-10 22:11:53,612 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.734e+02 1.925e+02 2.226e+02 3.161e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-10 22:11:54,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=497910.0, ans=0.125 2023-10-10 22:12:32,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=498050.0, ans=0.125 2023-10-10 22:12:33,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=498050.0, ans=0.125 2023-10-10 22:13:08,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-10-10 22:13:30,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=498330.0, ans=0.125 2023-10-10 22:13:41,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=498376.6666666667, ans=0.2 2023-10-10 22:13:43,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=498376.6666666667, ans=0.125 2023-10-10 22:13:43,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498376.6666666667, ans=0.1 2023-10-10 22:13:46,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.241e+02 1.580e+02 1.786e+02 2.051e+02 2.993e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-10 22:13:58,060 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.56 vs. limit=15.0 2023-10-10 22:13:59,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=12.0 2023-10-10 22:14:04,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498470.0, ans=0.125 2023-10-10 22:14:05,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=498470.0, ans=0.0 2023-10-10 22:14:17,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=498516.6666666667, ans=0.0 2023-10-10 22:14:25,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=22.5 2023-10-10 22:14:36,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.43 vs. limit=22.5 2023-10-10 22:14:38,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=498563.3333333333, ans=0.0 2023-10-10 22:14:39,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498563.3333333333, ans=0.1 2023-10-10 22:14:48,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=498610.0, ans=0.125 2023-10-10 22:15:07,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498703.3333333333, ans=0.1 2023-10-10 22:15:25,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=498750.0, ans=0.2 2023-10-10 22:15:42,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=498796.6666666667, ans=0.125 2023-10-10 22:15:49,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.780e+02 2.123e+02 2.745e+02 4.022e+02, threshold=4.246e+02, percent-clipped=3.0 2023-10-10 22:15:54,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=498890.0, ans=0.0 2023-10-10 22:15:57,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=498890.0, ans=0.125 2023-10-10 22:15:59,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.80 vs. limit=15.0 2023-10-10 22:16:34,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499030.0, ans=0.1 2023-10-10 22:16:40,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=499076.6666666667, ans=0.09899494936611666 2023-10-10 22:16:40,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=499076.6666666667, ans=0.0 2023-10-10 22:16:51,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=499076.6666666667, ans=0.125 2023-10-10 22:16:56,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=499123.3333333333, ans=0.125 2023-10-10 22:16:57,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.30 vs. limit=15.0 2023-10-10 22:17:13,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=499216.6666666667, ans=0.0 2023-10-10 22:17:16,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=499216.6666666667, ans=0.2 2023-10-10 22:17:26,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=499263.3333333333, ans=0.09899494936611666 2023-10-10 22:17:31,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=499263.3333333333, ans=0.0 2023-10-10 22:17:41,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.708e+02 1.901e+02 2.151e+02 3.057e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-10 22:17:42,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=499310.0, ans=0.2 2023-10-10 22:18:19,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499450.0, ans=0.1 2023-10-10 22:18:32,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=499496.6666666667, ans=0.0 2023-10-10 22:18:33,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.59 vs. limit=6.0 2023-10-10 22:18:42,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=499543.3333333333, ans=0.125 2023-10-10 22:18:44,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=499543.3333333333, ans=0.0 2023-10-10 22:18:51,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-10-10 22:18:57,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=499590.0, ans=0.0 2023-10-10 22:19:09,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=499636.6666666667, ans=0.125 2023-10-10 22:19:10,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=499636.6666666667, ans=0.125 2023-10-10 22:19:25,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=499683.3333333333, ans=0.125 2023-10-10 22:19:26,930 INFO [train.py:1031] (3/4) Epoch 8, batch 11500, loss[loss=0.2284, simple_loss=0.3217, pruned_loss=0.0675, over 16937.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2991, pruned_loss=0.06269, over 32699580.84 frames. ], batch size: 123, lr: 4.47e-03, grad_scale: 32.0 2023-10-10 22:19:43,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.764e+02 1.968e+02 2.369e+02 3.431e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-10 22:19:44,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-10-10 22:19:53,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499823.3333333333, ans=0.1 2023-10-10 22:19:57,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=499823.3333333333, ans=0.2 2023-10-10 22:20:35,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=499963.3333333333, ans=0.0 2023-10-10 22:20:48,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2023-10-10 22:21:26,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.27 vs. limit=22.5 2023-10-10 22:21:41,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.318e+02 1.624e+02 1.765e+02 1.938e+02 2.401e+02, threshold=3.531e+02, percent-clipped=0.0 2023-10-10 22:21:49,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=500290.0, ans=0.125 2023-10-10 22:21:53,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=500290.0, ans=0.0 2023-10-10 22:22:04,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=500336.6666666667, ans=0.125 2023-10-10 22:22:08,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500383.3333333333, ans=0.1 2023-10-10 22:22:09,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500383.3333333333, ans=0.1 2023-10-10 22:22:18,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500383.3333333333, ans=0.125 2023-10-10 22:22:36,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=500476.6666666667, ans=0.2 2023-10-10 22:22:55,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500570.0, ans=0.1 2023-10-10 22:23:03,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=500616.6666666667, ans=10.0 2023-10-10 22:23:12,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=500663.3333333333, ans=0.0 2023-10-10 22:23:19,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=500663.3333333333, ans=0.0 2023-10-10 22:23:27,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=500710.0, ans=0.125 2023-10-10 22:23:29,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.709e+02 1.881e+02 2.039e+02 2.801e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-10 22:23:56,317 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:24:10,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=500850.0, ans=0.125 2023-10-10 22:24:10,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=500850.0, ans=0.125 2023-10-10 22:24:11,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500850.0, ans=0.125 2023-10-10 22:25:12,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=501083.3333333333, ans=0.2 2023-10-10 22:25:18,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501130.0, ans=0.1 2023-10-10 22:25:24,962 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.64 vs. limit=6.0 2023-10-10 22:25:27,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-10-10 22:25:32,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=501176.6666666667, ans=0.0 2023-10-10 22:25:37,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.775e+02 2.121e+02 2.570e+02 3.401e+02, threshold=4.242e+02, percent-clipped=0.0 2023-10-10 22:26:01,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=501270.0, ans=0.2 2023-10-10 22:26:10,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=501316.6666666667, ans=0.0 2023-10-10 22:26:33,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501410.0, ans=0.1 2023-10-10 22:26:39,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2023-10-10 22:26:41,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=501456.6666666667, ans=0.0 2023-10-10 22:27:06,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=501550.0, ans=0.0 2023-10-10 22:27:08,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501550.0, ans=0.1 2023-10-10 22:27:32,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=501643.3333333333, ans=0.0 2023-10-10 22:27:34,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.695e+02 1.850e+02 2.006e+02 3.432e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-10 22:27:37,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-10-10 22:28:04,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=501783.3333333333, ans=0.0 2023-10-10 22:28:20,540 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:29:11,916 INFO [train.py:1031] (3/4) Epoch 8, batch 12000, loss[loss=0.2039, simple_loss=0.294, pruned_loss=0.0569, over 16880.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2993, pruned_loss=0.06253, over 32729566.38 frames. ], batch size: 130, lr: 4.46e-03, grad_scale: 32.0 2023-10-10 22:29:21,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=502063.3333333333, ans=0.2 2023-10-10 22:29:29,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.648e+02 1.857e+02 2.122e+02 3.242e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-10 22:29:33,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=502156.6666666667, ans=0.04949747468305833 2023-10-10 22:29:37,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=502156.6666666667, ans=0.95 2023-10-10 22:29:37,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-10-10 22:29:46,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.0 2023-10-10 22:29:48,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=502203.3333333333, ans=0.2 2023-10-10 22:30:30,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=502390.0, ans=22.5 2023-10-10 22:30:35,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502390.0, ans=0.1 2023-10-10 22:31:00,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-10-10 22:31:01,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=502483.3333333333, ans=0.0 2023-10-10 22:31:07,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=502530.0, ans=0.125 2023-10-10 22:31:19,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.652e+02 1.829e+02 2.185e+02 3.535e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-10 22:31:23,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=502623.3333333333, ans=0.035 2023-10-10 22:31:28,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502623.3333333333, ans=0.125 2023-10-10 22:31:37,599 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.825e-02 2023-10-10 22:31:37,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-10-10 22:32:15,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=502810.0, ans=0.1 2023-10-10 22:32:21,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-10-10 22:32:52,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=12.0 2023-10-10 22:32:57,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=502996.6666666667, ans=0.125 2023-10-10 22:33:08,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.769e+02 1.987e+02 2.395e+02 3.208e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-10 22:33:26,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-10 22:33:30,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=503136.6666666667, ans=22.5 2023-10-10 22:33:36,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=503183.3333333333, ans=0.125 2023-10-10 22:33:45,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=503230.0, ans=0.125 2023-10-10 22:34:00,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=503276.6666666667, ans=0.125 2023-10-10 22:34:28,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=503370.0, ans=0.0 2023-10-10 22:34:37,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=503416.6666666667, ans=0.2 2023-10-10 22:34:38,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.89 vs. limit=15.0 2023-10-10 22:34:40,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=503463.3333333333, ans=0.0 2023-10-10 22:34:52,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=12.0 2023-10-10 22:34:58,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.32 vs. limit=15.0 2023-10-10 22:34:58,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.734e+02 1.985e+02 2.261e+02 4.632e+02, threshold=3.971e+02, percent-clipped=1.0 2023-10-10 22:35:12,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=503603.3333333333, ans=0.2 2023-10-10 22:35:19,630 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:35:29,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-10-10 22:36:05,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=503790.0, ans=0.125 2023-10-10 22:36:18,148 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:36:38,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=503930.0, ans=0.125 2023-10-10 22:36:53,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.766e+02 1.945e+02 2.216e+02 2.957e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-10 22:37:05,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=504023.3333333333, ans=0.125 2023-10-10 22:37:18,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=504070.0, ans=0.2 2023-10-10 22:37:33,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=504163.3333333333, ans=0.0 2023-10-10 22:37:57,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.90 vs. limit=10.0 2023-10-10 22:38:09,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504303.3333333333, ans=0.1 2023-10-10 22:38:16,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=504303.3333333333, ans=0.0 2023-10-10 22:38:16,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=504303.3333333333, ans=0.2 2023-10-10 22:38:20,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=504350.0, ans=0.125 2023-10-10 22:38:21,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=504350.0, ans=0.0 2023-10-10 22:38:21,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=12.0 2023-10-10 22:38:30,908 INFO [train.py:1031] (3/4) Epoch 8, batch 12500, loss[loss=0.2017, simple_loss=0.297, pruned_loss=0.05322, over 16840.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2988, pruned_loss=0.06253, over 32735760.89 frames. ], batch size: 175, lr: 4.45e-03, grad_scale: 32.0 2023-10-10 22:38:36,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=504396.6666666667, ans=0.0 2023-10-10 22:38:48,670 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.630e+02 1.848e+02 2.192e+02 3.388e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-10 22:39:03,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504536.6666666667, ans=0.1 2023-10-10 22:39:10,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=504536.6666666667, ans=0.0 2023-10-10 22:39:12,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.61 vs. limit=15.0 2023-10-10 22:39:13,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-10 22:39:13,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=504583.3333333333, ans=0.0 2023-10-10 22:39:22,781 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:39:24,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=504630.0, ans=0.125 2023-10-10 22:39:29,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-10 22:39:33,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=504630.0, ans=0.035 2023-10-10 22:39:47,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=12.0 2023-10-10 22:40:25,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=504863.3333333333, ans=0.2 2023-10-10 22:40:28,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=504863.3333333333, ans=0.125 2023-10-10 22:40:35,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=504910.0, ans=0.0 2023-10-10 22:40:38,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.706e+02 1.964e+02 2.198e+02 3.199e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 22:40:41,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504910.0, ans=0.125 2023-10-10 22:40:57,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=505003.3333333333, ans=15.0 2023-10-10 22:41:00,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=505003.3333333333, ans=0.0 2023-10-10 22:41:21,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.46 vs. limit=22.5 2023-10-10 22:41:24,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=505096.6666666667, ans=0.2 2023-10-10 22:41:30,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-10-10 22:42:03,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=505283.3333333333, ans=0.95 2023-10-10 22:42:10,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=505283.3333333333, ans=10.0 2023-10-10 22:42:19,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=505330.0, ans=0.0 2023-10-10 22:42:29,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.273e+02 1.647e+02 1.780e+02 2.047e+02 2.520e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-10 22:42:29,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=505376.6666666667, ans=0.1 2023-10-10 22:42:30,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=505376.6666666667, ans=0.125 2023-10-10 22:43:02,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=505516.6666666667, ans=0.125 2023-10-10 22:43:03,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=505516.6666666667, ans=0.125 2023-10-10 22:43:03,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-10 22:43:25,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505610.0, ans=0.1 2023-10-10 22:43:28,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=505656.6666666667, ans=0.1 2023-10-10 22:43:30,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505656.6666666667, ans=0.125 2023-10-10 22:43:30,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-10-10 22:43:37,755 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:43:40,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-10 22:44:00,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=505796.6666666667, ans=0.125 2023-10-10 22:44:00,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=505796.6666666667, ans=0.07 2023-10-10 22:44:03,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=505796.6666666667, ans=0.125 2023-10-10 22:44:17,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.731e+02 1.943e+02 2.160e+02 2.946e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-10 22:44:17,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=505843.3333333333, ans=0.2 2023-10-10 22:44:26,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=505890.0, ans=0.125 2023-10-10 22:44:29,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=505890.0, ans=0.0 2023-10-10 22:44:55,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=506030.0, ans=0.035 2023-10-10 22:45:05,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506030.0, ans=0.1 2023-10-10 22:45:09,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=506076.6666666667, ans=0.0 2023-10-10 22:45:13,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=506076.6666666667, ans=0.0 2023-10-10 22:45:15,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-10-10 22:45:17,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=506123.3333333333, ans=0.0 2023-10-10 22:45:33,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506170.0, ans=0.1 2023-10-10 22:45:55,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=506263.3333333333, ans=0.5 2023-10-10 22:46:12,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.695e+02 1.869e+02 2.112e+02 2.731e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-10 22:46:12,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=506310.0, ans=0.2 2023-10-10 22:46:13,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=506310.0, ans=0.0 2023-10-10 22:46:18,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=506356.6666666667, ans=0.125 2023-10-10 22:46:27,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506403.3333333333, ans=0.1 2023-10-10 22:46:51,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=506496.6666666667, ans=0.125 2023-10-10 22:47:11,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=506590.0, ans=0.0 2023-10-10 22:47:17,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=506590.0, ans=0.125 2023-10-10 22:47:41,454 INFO [train.py:1031] (3/4) Epoch 8, batch 13000, loss[loss=0.1882, simple_loss=0.285, pruned_loss=0.04568, over 16864.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2996, pruned_loss=0.06272, over 32763793.37 frames. ], batch size: 98, lr: 4.44e-03, grad_scale: 32.0 2023-10-10 22:47:50,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=506776.6666666667, ans=0.2 2023-10-10 22:47:59,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.276e+02 1.651e+02 1.807e+02 2.072e+02 3.078e+02, threshold=3.614e+02, percent-clipped=0.0 2023-10-10 22:48:04,496 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:48:04,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=506823.3333333333, ans=0.2 2023-10-10 22:48:12,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=506823.3333333333, ans=0.1 2023-10-10 22:48:16,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=506823.3333333333, ans=0.125 2023-10-10 22:48:28,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.27 vs. limit=15.0 2023-10-10 22:48:32,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=506916.6666666667, ans=0.125 2023-10-10 22:48:35,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=506916.6666666667, ans=0.125 2023-10-10 22:49:09,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=507056.6666666667, ans=0.0 2023-10-10 22:49:16,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=507056.6666666667, ans=0.125 2023-10-10 22:49:17,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=507056.6666666667, ans=0.0 2023-10-10 22:49:21,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=507103.3333333333, ans=0.0 2023-10-10 22:49:28,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507103.3333333333, ans=0.1 2023-10-10 22:49:47,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=507196.6666666667, ans=0.0 2023-10-10 22:49:54,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2023-10-10 22:49:54,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-10-10 22:50:01,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=507243.3333333333, ans=0.0 2023-10-10 22:50:02,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.687e+02 1.929e+02 2.219e+02 3.912e+02, threshold=3.857e+02, percent-clipped=1.0 2023-10-10 22:50:10,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=507290.0, ans=0.125 2023-10-10 22:50:53,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=507476.6666666667, ans=10.0 2023-10-10 22:50:59,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=507476.6666666667, ans=0.05 2023-10-10 22:51:01,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=507476.6666666667, ans=0.2 2023-10-10 22:51:18,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=507570.0, ans=0.125 2023-10-10 22:51:23,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=507570.0, ans=0.125 2023-10-10 22:51:37,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=507616.6666666667, ans=0.0 2023-10-10 22:51:49,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=507663.3333333333, ans=0.125 2023-10-10 22:51:56,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=507710.0, ans=0.0 2023-10-10 22:51:59,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.675e+02 1.899e+02 2.135e+02 3.177e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-10 22:52:03,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-10-10 22:52:18,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=507803.3333333333, ans=0.2 2023-10-10 22:52:25,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=507850.0, ans=0.125 2023-10-10 22:53:26,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=508083.3333333333, ans=0.035 2023-10-10 22:53:30,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=508083.3333333333, ans=0.125 2023-10-10 22:53:44,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-10-10 22:53:52,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.695e+02 1.901e+02 2.207e+02 3.570e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-10 22:53:54,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=508176.6666666667, ans=0.125 2023-10-10 22:54:17,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=508316.6666666667, ans=0.125 2023-10-10 22:54:18,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=508316.6666666667, ans=0.125 2023-10-10 22:54:18,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508316.6666666667, ans=0.125 2023-10-10 22:54:38,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=508410.0, ans=0.0 2023-10-10 22:55:01,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=508503.3333333333, ans=0.2 2023-10-10 22:55:07,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-10-10 22:55:08,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=508503.3333333333, ans=0.2 2023-10-10 22:55:32,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=508596.6666666667, ans=0.0 2023-10-10 22:55:39,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=508643.3333333333, ans=0.125 2023-10-10 22:55:40,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508643.3333333333, ans=0.125 2023-10-10 22:55:44,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.709e+02 1.911e+02 2.190e+02 3.083e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-10 22:56:48,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508923.3333333333, ans=0.1 2023-10-10 22:56:56,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=508970.0, ans=10.0 2023-10-10 22:56:59,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=508970.0, ans=0.07 2023-10-10 22:57:09,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=509016.6666666667, ans=0.1 2023-10-10 22:57:16,929 INFO [train.py:1031] (3/4) Epoch 8, batch 13500, loss[loss=0.206, simple_loss=0.2978, pruned_loss=0.05707, over 16816.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2989, pruned_loss=0.06244, over 32796046.71 frames. ], batch size: 175, lr: 4.43e-03, grad_scale: 32.0 2023-10-10 22:57:29,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=509110.0, ans=0.125 2023-10-10 22:57:36,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.597e+02 1.795e+02 2.069e+02 3.001e+02, threshold=3.590e+02, percent-clipped=0.0 2023-10-10 22:57:41,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=509156.6666666667, ans=0.2 2023-10-10 22:58:12,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509250.0, ans=0.1 2023-10-10 22:58:18,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=509296.6666666667, ans=0.2 2023-10-10 22:58:45,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509390.0, ans=0.1 2023-10-10 22:59:09,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=509530.0, ans=0.125 2023-10-10 22:59:26,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.703e+02 1.895e+02 2.157e+02 2.727e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-10 22:59:26,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=509576.6666666667, ans=0.1 2023-10-10 22:59:59,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=509763.3333333333, ans=0.0 2023-10-10 23:00:33,951 INFO [train.py:1031] (3/4) Epoch 9, batch 0, loss[loss=0.1793, simple_loss=0.2681, pruned_loss=0.0453, over 16726.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2681, pruned_loss=0.0453, over 16726.00 frames. ], batch size: 241, lr: 4.15e-03, grad_scale: 32.0 2023-10-10 23:00:33,953 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-10 23:00:42,272 INFO [train.py:1063] (3/4) Epoch 9, validation: loss=0.2237, simple_loss=0.3102, pruned_loss=0.06853, over 1020973.00 frames. 2023-10-10 23:00:42,273 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-10 23:00:43,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-10-10 23:00:45,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=509786.6666666667, ans=0.2 2023-10-10 23:01:14,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=509880.0, ans=0.0 2023-10-10 23:01:16,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=509926.6666666667, ans=0.05 2023-10-10 23:01:38,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=509973.3333333333, ans=0.05 2023-10-10 23:01:42,681 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:01:53,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=510066.6666666667, ans=0.05 2023-10-10 23:01:55,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.657e+02 1.836e+02 2.096e+02 3.061e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-10 23:01:56,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=510066.6666666667, ans=0.125 2023-10-10 23:02:05,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2023-10-10 23:02:09,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=510113.3333333333, ans=0.125 2023-10-10 23:02:11,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2023-10-10 23:02:15,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510160.0, ans=0.1 2023-10-10 23:02:19,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510160.0, ans=0.1 2023-10-10 23:02:30,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=22.5 2023-10-10 23:02:32,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=510206.6666666667, ans=0.0 2023-10-10 23:02:32,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=510206.6666666667, ans=0.0 2023-10-10 23:02:46,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=510300.0, ans=0.0 2023-10-10 23:02:47,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-10-10 23:03:11,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=510393.3333333333, ans=0.1 2023-10-10 23:03:13,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-10-10 23:03:25,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-10 23:03:26,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=510440.0, ans=0.125 2023-10-10 23:03:41,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=510533.3333333333, ans=0.07 2023-10-10 23:03:45,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.731e+02 2.020e+02 2.304e+02 2.929e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-10 23:04:07,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-10 23:04:39,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.41 vs. limit=15.0 2023-10-10 23:04:43,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=510766.6666666667, ans=0.125 2023-10-10 23:04:45,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=510766.6666666667, ans=0.0 2023-10-10 23:04:49,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=510813.3333333333, ans=0.125 2023-10-10 23:04:59,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=510813.3333333333, ans=0.07 2023-10-10 23:05:12,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=510906.6666666667, ans=0.0 2023-10-10 23:05:18,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=510906.6666666667, ans=0.0 2023-10-10 23:05:20,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=510906.6666666667, ans=0.125 2023-10-10 23:05:43,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.650e+02 1.775e+02 2.030e+02 2.890e+02, threshold=3.550e+02, percent-clipped=0.0 2023-10-10 23:05:46,320 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:05:49,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=511000.0, ans=0.125 2023-10-10 23:05:51,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=511046.6666666667, ans=0.125 2023-10-10 23:06:21,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=511140.0, ans=0.125 2023-10-10 23:06:25,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511186.6666666667, ans=0.1 2023-10-10 23:06:32,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=511186.6666666667, ans=0.125 2023-10-10 23:06:42,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=511233.3333333333, ans=0.125 2023-10-10 23:06:58,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=511326.6666666667, ans=0.125 2023-10-10 23:07:02,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=511326.6666666667, ans=0.5 2023-10-10 23:07:04,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511326.6666666667, ans=0.1 2023-10-10 23:07:12,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=511373.3333333333, ans=0.125 2023-10-10 23:07:19,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=511420.0, ans=0.0 2023-10-10 23:07:21,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=511420.0, ans=0.125 2023-10-10 23:07:21,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=511420.0, ans=0.125 2023-10-10 23:07:24,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=511420.0, ans=0.125 2023-10-10 23:07:26,977 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=12.0 2023-10-10 23:07:31,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=511466.6666666667, ans=0.0 2023-10-10 23:07:33,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.664e+02 1.856e+02 2.171e+02 3.433e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-10 23:07:41,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=511513.3333333333, ans=0.125 2023-10-10 23:07:56,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-10-10 23:08:01,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.05 vs. limit=22.5 2023-10-10 23:08:14,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511653.3333333333, ans=0.1 2023-10-10 23:08:47,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=511793.3333333333, ans=0.125 2023-10-10 23:09:22,677 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:09:24,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.654e+02 1.828e+02 1.989e+02 2.734e+02, threshold=3.656e+02, percent-clipped=0.0 2023-10-10 23:09:28,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=15.0 2023-10-10 23:09:34,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=511980.0, ans=10.0 2023-10-10 23:09:38,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=511980.0, ans=0.04949747468305833 2023-10-10 23:09:39,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=511980.0, ans=0.09899494936611666 2023-10-10 23:10:02,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=512073.3333333333, ans=0.0 2023-10-10 23:10:06,664 INFO [train.py:1031] (3/4) Epoch 9, batch 500, loss[loss=0.2171, simple_loss=0.2742, pruned_loss=0.08001, over 12500.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2985, pruned_loss=0.06218, over 7287633.96 frames. ], batch size: 440, lr: 4.14e-03, grad_scale: 32.0 2023-10-10 23:10:21,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=512166.6666666667, ans=0.5 2023-10-10 23:10:24,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=512166.6666666667, ans=0.125 2023-10-10 23:10:38,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=512213.3333333333, ans=0.125 2023-10-10 23:11:10,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=512353.3333333333, ans=0.2 2023-10-10 23:11:11,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=512353.3333333333, ans=0.2 2023-10-10 23:11:16,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.668e+02 1.855e+02 2.269e+02 3.835e+02, threshold=3.711e+02, percent-clipped=1.0 2023-10-10 23:11:22,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=512400.0, ans=0.0 2023-10-10 23:11:24,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=512446.6666666667, ans=0.1 2023-10-10 23:11:56,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=512586.6666666667, ans=0.2 2023-10-10 23:12:01,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=512586.6666666667, ans=0.2 2023-10-10 23:12:07,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=512633.3333333333, ans=0.125 2023-10-10 23:12:49,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512820.0, ans=0.1 2023-10-10 23:12:54,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512820.0, ans=0.1 2023-10-10 23:12:55,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=512820.0, ans=0.125 2023-10-10 23:12:56,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=512820.0, ans=0.0 2023-10-10 23:13:02,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.209e+02 1.685e+02 1.884e+02 2.103e+02 2.806e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-10 23:13:05,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=512866.6666666667, ans=0.125 2023-10-10 23:13:11,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0 2023-10-10 23:13:18,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=512913.3333333333, ans=0.125 2023-10-10 23:13:55,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=513100.0, ans=0.2 2023-10-10 23:14:28,695 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:14:30,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=513240.0, ans=0.2 2023-10-10 23:14:37,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=513240.0, ans=0.125 2023-10-10 23:14:54,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.779e+02 2.127e+02 2.346e+02 3.752e+02, threshold=4.255e+02, percent-clipped=0.0 2023-10-10 23:14:59,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=513333.3333333333, ans=0.0 2023-10-10 23:15:00,062 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:15:02,069 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=15.0 2023-10-10 23:15:08,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=513380.0, ans=0.125 2023-10-10 23:15:26,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=513473.3333333333, ans=0.02 2023-10-10 23:15:32,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=513520.0, ans=0.0 2023-10-10 23:15:33,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=513520.0, ans=0.07 2023-10-10 23:16:35,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=513753.3333333333, ans=0.0 2023-10-10 23:16:41,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2023-10-10 23:16:53,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.736e+02 1.976e+02 2.360e+02 3.214e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-10 23:17:03,109 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:17:08,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=513846.6666666667, ans=0.125 2023-10-10 23:17:20,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=15.0 2023-10-10 23:17:33,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=513986.6666666667, ans=0.0 2023-10-10 23:17:37,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=513986.6666666667, ans=0.125 2023-10-10 23:17:58,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=514080.0, ans=0.0 2023-10-10 23:18:12,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-10-10 23:18:22,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=15.0 2023-10-10 23:18:33,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=514220.0, ans=0.95 2023-10-10 23:18:44,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.626e+02 1.743e+02 1.973e+02 2.695e+02, threshold=3.486e+02, percent-clipped=0.0 2023-10-10 23:19:14,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-10-10 23:19:16,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=514406.6666666667, ans=0.125 2023-10-10 23:19:16,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=514406.6666666667, ans=0.0 2023-10-10 23:19:23,015 INFO [train.py:1031] (3/4) Epoch 9, batch 1000, loss[loss=0.1992, simple_loss=0.2939, pruned_loss=0.0522, over 16933.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2995, pruned_loss=0.06299, over 12929286.07 frames. ], batch size: 123, lr: 4.13e-03, grad_scale: 32.0 2023-10-10 23:19:26,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=514453.3333333333, ans=0.0 2023-10-10 23:19:33,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-10-10 23:19:46,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=514546.6666666667, ans=0.0 2023-10-10 23:20:29,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.675e+02 1.911e+02 2.143e+02 3.098e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-10 23:20:51,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2023-10-10 23:21:07,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514873.3333333333, ans=0.1 2023-10-10 23:21:20,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=514920.0, ans=0.07 2023-10-10 23:21:29,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=514966.6666666667, ans=0.125 2023-10-10 23:21:48,514 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:21:52,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=515060.0, ans=0.125 2023-10-10 23:22:03,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=515106.6666666667, ans=0.015 2023-10-10 23:22:03,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=515106.6666666667, ans=0.0 2023-10-10 23:22:17,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.85 vs. limit=10.0 2023-10-10 23:22:29,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.649e+02 1.859e+02 2.076e+02 2.784e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-10 23:22:43,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.62 vs. limit=10.0 2023-10-10 23:22:53,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=515293.3333333333, ans=0.125 2023-10-10 23:22:57,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=515293.3333333333, ans=0.0 2023-10-10 23:23:04,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=515340.0, ans=0.125 2023-10-10 23:23:24,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=515433.3333333333, ans=0.0 2023-10-10 23:23:33,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=515480.0, ans=0.125 2023-10-10 23:23:45,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=515526.6666666667, ans=0.125 2023-10-10 23:23:46,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.58 vs. limit=15.0 2023-10-10 23:24:01,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=515573.3333333333, ans=0.09899494936611666 2023-10-10 23:24:16,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=515620.0, ans=0.125 2023-10-10 23:24:24,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.675e+02 1.838e+02 2.119e+02 2.965e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-10 23:24:37,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=515713.3333333333, ans=0.125 2023-10-10 23:24:50,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=515760.0, ans=0.0 2023-10-10 23:24:53,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=515806.6666666667, ans=0.2 2023-10-10 23:24:59,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=15.0 2023-10-10 23:25:05,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=515853.3333333333, ans=0.0 2023-10-10 23:25:11,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.44 vs. limit=22.5 2023-10-10 23:25:14,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.99 vs. limit=22.5 2023-10-10 23:25:40,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-10 23:25:45,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=515993.3333333333, ans=0.125 2023-10-10 23:26:05,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516086.6666666667, ans=0.1 2023-10-10 23:26:07,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=22.5 2023-10-10 23:26:14,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.256e+02 1.665e+02 1.833e+02 2.003e+02 2.892e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 23:26:14,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=516133.3333333333, ans=0.125 2023-10-10 23:26:20,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=516180.0, ans=0.125 2023-10-10 23:26:29,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=516180.0, ans=0.125 2023-10-10 23:26:52,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=516320.0, ans=0.125 2023-10-10 23:26:52,620 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=12.0 2023-10-10 23:27:08,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=516366.6666666667, ans=0.07 2023-10-10 23:27:09,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516366.6666666667, ans=0.1 2023-10-10 23:27:18,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=516413.3333333333, ans=0.0 2023-10-10 23:27:19,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=516413.3333333333, ans=0.125 2023-10-10 23:27:21,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=516413.3333333333, ans=0.09899494936611666 2023-10-10 23:27:33,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=516460.0, ans=0.0 2023-10-10 23:27:54,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=516553.3333333333, ans=0.125 2023-10-10 23:27:55,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516553.3333333333, ans=0.1 2023-10-10 23:28:00,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=516553.3333333333, ans=0.125 2023-10-10 23:28:08,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.593e+02 1.810e+02 1.947e+02 2.398e+02, threshold=3.620e+02, percent-clipped=0.0 2023-10-10 23:28:11,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-10-10 23:28:13,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=516600.0, ans=0.0 2023-10-10 23:28:16,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.04 vs. limit=15.0 2023-10-10 23:28:49,635 INFO [train.py:1031] (3/4) Epoch 9, batch 1500, loss[loss=0.2143, simple_loss=0.3012, pruned_loss=0.06371, over 16677.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2973, pruned_loss=0.06187, over 17328606.48 frames. ], batch size: 202, lr: 4.12e-03, grad_scale: 32.0 2023-10-10 23:28:57,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-10-10 23:29:02,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=516833.3333333333, ans=0.125 2023-10-10 23:29:06,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=516833.3333333333, ans=0.0 2023-10-10 23:29:09,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=516833.3333333333, ans=0.0 2023-10-10 23:29:10,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=516833.3333333333, ans=0.125 2023-10-10 23:29:22,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.03 vs. limit=12.0 2023-10-10 23:29:59,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517020.0, ans=0.1 2023-10-10 23:30:02,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=517066.6666666667, ans=0.125 2023-10-10 23:30:04,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.725e+02 1.886e+02 2.081e+02 2.812e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-10 23:30:06,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=517066.6666666667, ans=0.125 2023-10-10 23:30:10,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=517113.3333333333, ans=0.0 2023-10-10 23:30:13,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.05 vs. limit=15.0 2023-10-10 23:30:29,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517160.0, ans=0.125 2023-10-10 23:30:30,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-10 23:30:44,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=517206.6666666667, ans=0.2 2023-10-10 23:30:45,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=517253.3333333333, ans=0.0 2023-10-10 23:31:00,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-10-10 23:31:10,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=22.5 2023-10-10 23:31:18,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.39 vs. limit=6.0 2023-10-10 23:31:22,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517393.3333333333, ans=0.1 2023-10-10 23:32:03,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=517533.3333333333, ans=0.125 2023-10-10 23:32:05,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.647e+02 1.801e+02 2.106e+02 2.923e+02, threshold=3.602e+02, percent-clipped=0.0 2023-10-10 23:32:28,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.70 vs. limit=15.0 2023-10-10 23:32:29,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=517626.6666666667, ans=0.125 2023-10-10 23:32:54,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=517766.6666666667, ans=0.125 2023-10-10 23:33:09,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-10-10 23:33:17,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.19 vs. limit=15.0 2023-10-10 23:33:34,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=517906.6666666667, ans=0.125 2023-10-10 23:33:46,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=517953.3333333333, ans=0.125 2023-10-10 23:33:47,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=22.5 2023-10-10 23:33:52,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.742e+02 1.920e+02 2.128e+02 3.345e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-10 23:34:02,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=518046.6666666667, ans=0.0 2023-10-10 23:34:09,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=518046.6666666667, ans=0.0 2023-10-10 23:34:12,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=518093.3333333333, ans=0.0 2023-10-10 23:34:14,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=518093.3333333333, ans=10.0 2023-10-10 23:34:22,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=518093.3333333333, ans=0.125 2023-10-10 23:34:23,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=518093.3333333333, ans=0.125 2023-10-10 23:34:43,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=518186.6666666667, ans=0.07 2023-10-10 23:34:47,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=518233.3333333333, ans=0.2 2023-10-10 23:34:53,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.01 vs. limit=12.0 2023-10-10 23:34:53,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=518233.3333333333, ans=0.0 2023-10-10 23:34:59,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=518280.0, ans=0.2 2023-10-10 23:35:10,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=518326.6666666667, ans=0.125 2023-10-10 23:35:11,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-10-10 23:35:27,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=518373.3333333333, ans=0.125 2023-10-10 23:35:46,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.702e+02 1.886e+02 2.106e+02 2.813e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-10 23:35:50,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=518466.6666666667, ans=0.125 2023-10-10 23:35:54,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.39 vs. limit=12.0 2023-10-10 23:36:07,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=518560.0, ans=0.0 2023-10-10 23:36:07,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=518560.0, ans=0.125 2023-10-10 23:36:08,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2023-10-10 23:36:11,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=518560.0, ans=0.0 2023-10-10 23:36:17,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=518606.6666666667, ans=0.0 2023-10-10 23:36:23,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=518606.6666666667, ans=0.125 2023-10-10 23:36:41,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=518700.0, ans=0.0 2023-10-10 23:36:43,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=518700.0, ans=0.125 2023-10-10 23:36:55,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-10-10 23:37:04,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=518793.3333333333, ans=15.0 2023-10-10 23:37:23,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-10-10 23:37:26,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=518840.0, ans=0.125 2023-10-10 23:37:35,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518886.6666666667, ans=0.1 2023-10-10 23:37:49,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.716e+02 1.927e+02 2.183e+02 2.849e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-10 23:38:06,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518980.0, ans=0.1 2023-10-10 23:38:13,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=519026.6666666667, ans=0.125 2023-10-10 23:38:33,648 INFO [train.py:1031] (3/4) Epoch 9, batch 2000, loss[loss=0.2106, simple_loss=0.3088, pruned_loss=0.05624, over 16898.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2979, pruned_loss=0.06186, over 20744028.78 frames. ], batch size: 165, lr: 4.11e-03, grad_scale: 32.0 2023-10-10 23:38:36,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=519120.0, ans=0.0 2023-10-10 23:38:36,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=519120.0, ans=0.2 2023-10-10 23:39:22,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=519260.0, ans=0.125 2023-10-10 23:39:34,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519306.6666666667, ans=0.1 2023-10-10 23:39:43,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.71 vs. limit=15.0 2023-10-10 23:39:47,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=519353.3333333333, ans=15.0 2023-10-10 23:39:55,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.608e+02 1.829e+02 2.171e+02 3.295e+02, threshold=3.657e+02, percent-clipped=0.0 2023-10-10 23:40:27,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519540.0, ans=0.1 2023-10-10 23:40:27,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=519540.0, ans=0.125 2023-10-10 23:40:33,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=519540.0, ans=0.09899494936611666 2023-10-10 23:40:38,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519540.0, ans=0.1 2023-10-10 23:41:11,192 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:41:32,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=519726.6666666667, ans=0.125 2023-10-10 23:42:16,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.680e+02 1.845e+02 2.077e+02 3.271e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 23:42:30,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-10-10 23:43:01,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520053.3333333333, ans=0.1 2023-10-10 23:43:07,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=520053.3333333333, ans=0.05 2023-10-10 23:43:14,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=520100.0, ans=15.0 2023-10-10 23:43:25,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=15.0 2023-10-10 23:43:28,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=520146.6666666667, ans=0.0 2023-10-10 23:43:31,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=520193.3333333333, ans=0.2 2023-10-10 23:43:42,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=520240.0, ans=0.0 2023-10-10 23:43:43,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=520240.0, ans=0.0 2023-10-10 23:43:43,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=520240.0, ans=0.0 2023-10-10 23:43:47,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.65 vs. limit=10.0 2023-10-10 23:43:58,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=520286.6666666667, ans=0.2 2023-10-10 23:44:09,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.743e+02 2.004e+02 2.204e+02 2.872e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 23:44:14,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=520380.0, ans=0.125 2023-10-10 23:44:21,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=520380.0, ans=0.125 2023-10-10 23:44:23,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=520380.0, ans=0.2 2023-10-10 23:44:46,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-10 23:45:07,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=520613.3333333333, ans=0.2 2023-10-10 23:45:13,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-10-10 23:45:24,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=520660.0, ans=0.05 2023-10-10 23:45:34,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.29 vs. limit=22.5 2023-10-10 23:45:46,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520753.3333333333, ans=0.1 2023-10-10 23:45:46,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=520753.3333333333, ans=0.0 2023-10-10 23:45:58,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.703e+02 1.875e+02 2.055e+02 3.041e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-10 23:46:12,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=520846.6666666667, ans=0.125 2023-10-10 23:46:30,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=520940.0, ans=10.0 2023-10-10 23:46:52,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=521033.3333333333, ans=0.125 2023-10-10 23:47:13,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521126.6666666667, ans=0.1 2023-10-10 23:47:15,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=521126.6666666667, ans=0.125 2023-10-10 23:47:21,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.72 vs. limit=15.0 2023-10-10 23:47:30,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521220.0, ans=0.125 2023-10-10 23:47:43,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=521266.6666666667, ans=0.2 2023-10-10 23:47:46,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.737e+02 1.986e+02 2.280e+02 3.860e+02, threshold=3.971e+02, percent-clipped=1.0 2023-10-10 23:47:52,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=521313.3333333333, ans=0.1 2023-10-10 23:48:00,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521313.3333333333, ans=0.1 2023-10-10 23:48:05,519 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-10-10 23:48:24,383 INFO [train.py:1031] (3/4) Epoch 9, batch 2500, loss[loss=0.1985, simple_loss=0.257, pruned_loss=0.07003, over 12473.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2979, pruned_loss=0.06182, over 23402819.83 frames. ], batch size: 440, lr: 4.10e-03, grad_scale: 32.0 2023-10-10 23:48:39,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2023-10-10 23:48:46,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=15.0 2023-10-10 23:49:33,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.804e+02 2.002e+02 2.243e+02 3.113e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-10 23:49:40,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=521780.0, ans=0.125 2023-10-10 23:49:46,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=521780.0, ans=0.0 2023-10-10 23:50:32,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=521966.6666666667, ans=0.125 2023-10-10 23:50:40,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-10-10 23:50:41,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=522013.3333333333, ans=0.125 2023-10-10 23:50:41,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=522013.3333333333, ans=0.125 2023-10-10 23:50:48,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=522060.0, ans=0.125 2023-10-10 23:51:02,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-10 23:51:05,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=522106.6666666667, ans=0.125 2023-10-10 23:51:24,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=522200.0, ans=0.125 2023-10-10 23:51:25,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=522200.0, ans=0.0 2023-10-10 23:51:26,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.689e+02 1.947e+02 2.156e+02 3.168e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-10 23:51:33,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=522246.6666666667, ans=0.125 2023-10-10 23:51:49,251 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:52:13,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=522386.6666666667, ans=0.125 2023-10-10 23:52:32,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=522480.0, ans=0.0 2023-10-10 23:52:37,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=522480.0, ans=0.125 2023-10-10 23:52:45,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.90 vs. limit=10.0 2023-10-10 23:52:55,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=522573.3333333333, ans=0.125 2023-10-10 23:53:30,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.762e+02 2.055e+02 2.342e+02 3.249e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-10 23:53:34,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=522666.6666666667, ans=0.2 2023-10-10 23:53:53,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=522760.0, ans=0.125 2023-10-10 23:54:13,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2023-10-10 23:55:00,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2023-10-10 23:55:11,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-10-10 23:55:13,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=523086.6666666667, ans=0.2 2023-10-10 23:55:14,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=523086.6666666667, ans=0.5 2023-10-10 23:55:32,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.661e+02 1.839e+02 2.137e+02 2.865e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-10 23:55:38,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523180.0, ans=0.1 2023-10-10 23:55:40,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=523180.0, ans=0.125 2023-10-10 23:55:51,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=523226.6666666667, ans=0.125 2023-10-10 23:55:51,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=523226.6666666667, ans=0.1 2023-10-10 23:56:20,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=523320.0, ans=0.125 2023-10-10 23:56:31,164 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:57:08,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=523506.6666666667, ans=0.125 2023-10-10 23:57:12,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=523553.3333333333, ans=0.2 2023-10-10 23:57:14,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.05 vs. limit=22.5 2023-10-10 23:57:26,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=523600.0, ans=0.125 2023-10-10 23:57:30,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.743e+02 2.036e+02 2.306e+02 2.961e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-10 23:57:46,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.43 vs. limit=10.0 2023-10-10 23:57:51,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=523693.3333333333, ans=0.2 2023-10-10 23:57:58,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=523740.0, ans=0.035 2023-10-10 23:58:08,604 INFO [train.py:1031] (3/4) Epoch 9, batch 3000, loss[loss=0.219, simple_loss=0.3025, pruned_loss=0.0678, over 16851.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2969, pruned_loss=0.06153, over 25495705.29 frames. ], batch size: 188, lr: 4.09e-03, grad_scale: 16.0 2023-10-10 23:58:26,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=523833.3333333333, ans=0.2 2023-10-10 23:58:46,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=523926.6666666667, ans=0.025 2023-10-10 23:58:54,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=523973.3333333333, ans=0.0 2023-10-10 23:59:09,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=524020.0, ans=0.04949747468305833 2023-10-10 23:59:24,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.692e+02 1.846e+02 2.057e+02 2.838e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-10 23:59:29,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=524113.3333333333, ans=0.0 2023-10-10 23:59:35,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=524113.3333333333, ans=0.0 2023-10-10 23:59:48,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524160.0, ans=0.1 2023-10-10 23:59:59,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524206.6666666667, ans=0.1 2023-10-11 00:00:09,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=524253.3333333333, ans=0.0 2023-10-11 00:00:09,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=524253.3333333333, ans=0.125 2023-10-11 00:00:09,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=524253.3333333333, ans=0.0 2023-10-11 00:00:37,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524346.6666666666, ans=0.0 2023-10-11 00:00:46,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=524393.3333333334, ans=0.5 2023-10-11 00:00:46,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=524393.3333333334, ans=0.2 2023-10-11 00:00:55,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.81 vs. limit=22.5 2023-10-11 00:01:05,334 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.98 vs. limit=15.0 2023-10-11 00:01:12,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=524533.3333333334, ans=0.125 2023-10-11 00:01:12,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-10-11 00:01:18,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.648e+02 1.996e+02 2.530e+02 3.660e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-11 00:01:19,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524533.3333333334, ans=0.1 2023-10-11 00:01:21,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-10-11 00:01:29,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=22.5 2023-10-11 00:01:31,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=524580.0, ans=0.125 2023-10-11 00:01:35,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=524626.6666666666, ans=0.125 2023-10-11 00:01:35,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=524626.6666666666, ans=0.125 2023-10-11 00:01:39,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524626.6666666666, ans=0.1 2023-10-11 00:01:41,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=524626.6666666666, ans=0.125 2023-10-11 00:02:43,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=524860.0, ans=0.125 2023-10-11 00:03:01,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=524953.3333333334, ans=0.125 2023-10-11 00:03:23,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.688e+02 1.937e+02 2.215e+02 3.236e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-11 00:03:29,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=525046.6666666666, ans=0.0 2023-10-11 00:04:04,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=525186.6666666666, ans=0.0 2023-10-11 00:04:05,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=525186.6666666666, ans=0.125 2023-10-11 00:04:17,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.99 vs. limit=22.5 2023-10-11 00:04:18,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=525233.3333333334, ans=0.2 2023-10-11 00:04:18,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=525233.3333333334, ans=0.2 2023-10-11 00:04:18,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=525233.3333333334, ans=0.125 2023-10-11 00:04:40,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-10-11 00:05:15,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.686e+02 1.881e+02 2.242e+02 3.098e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-11 00:05:24,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=22.5 2023-10-11 00:05:45,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=525606.6666666666, ans=0.07 2023-10-11 00:06:05,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=525653.3333333334, ans=0.2 2023-10-11 00:06:14,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=525700.0, ans=0.125 2023-10-11 00:06:14,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=525700.0, ans=0.125 2023-10-11 00:06:30,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=525793.3333333334, ans=0.125 2023-10-11 00:06:36,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=525793.3333333334, ans=0.125 2023-10-11 00:06:40,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-10-11 00:06:46,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=525840.0, ans=0.125 2023-10-11 00:06:50,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2023-10-11 00:06:51,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=525886.6666666666, ans=0.125 2023-10-11 00:06:59,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=525933.3333333334, ans=0.125 2023-10-11 00:07:07,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.717e+02 1.863e+02 2.032e+02 2.776e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 00:07:09,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=525980.0, ans=0.125 2023-10-11 00:07:18,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=525980.0, ans=0.125 2023-10-11 00:07:31,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=526026.6666666666, ans=0.125 2023-10-11 00:07:34,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=526073.3333333334, ans=0.09899494936611666 2023-10-11 00:07:36,343 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:07:39,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=526073.3333333334, ans=0.07 2023-10-11 00:07:44,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526120.0, ans=0.1 2023-10-11 00:07:45,739 INFO [train.py:1031] (3/4) Epoch 9, batch 3500, loss[loss=0.1956, simple_loss=0.2797, pruned_loss=0.05571, over 16359.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2971, pruned_loss=0.06174, over 27123625.05 frames. ], batch size: 50, lr: 4.08e-03, grad_scale: 16.0 2023-10-11 00:07:46,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=526120.0, ans=0.125 2023-10-11 00:08:30,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=526306.6666666666, ans=0.125 2023-10-11 00:09:02,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.776e+02 1.902e+02 2.199e+02 3.450e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-11 00:09:25,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-11 00:09:27,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=22.5 2023-10-11 00:10:16,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526680.0, ans=0.1 2023-10-11 00:10:32,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=526773.3333333334, ans=0.0 2023-10-11 00:10:41,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=526773.3333333334, ans=0.5 2023-10-11 00:10:41,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=526773.3333333334, ans=0.0 2023-10-11 00:10:45,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=526820.0, ans=0.2 2023-10-11 00:10:51,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=526820.0, ans=0.125 2023-10-11 00:11:02,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.679e+02 1.934e+02 2.270e+02 3.371e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-11 00:11:03,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=526866.6666666666, ans=0.125 2023-10-11 00:11:06,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=526913.3333333334, ans=0.2 2023-10-11 00:11:06,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.46 vs. limit=15.0 2023-10-11 00:11:19,837 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:11:56,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=22.5 2023-10-11 00:12:13,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.84 vs. limit=10.0 2023-10-11 00:12:33,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=527240.0, ans=0.0 2023-10-11 00:12:44,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-10-11 00:12:46,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=527286.6666666666, ans=0.1 2023-10-11 00:13:04,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.649e+02 1.778e+02 2.034e+02 3.027e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-11 00:13:27,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=527426.6666666666, ans=0.1 2023-10-11 00:13:43,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=527520.0, ans=0.0 2023-10-11 00:13:43,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=527520.0, ans=0.125 2023-10-11 00:13:47,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=527520.0, ans=0.125 2023-10-11 00:13:50,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=527520.0, ans=0.0 2023-10-11 00:13:50,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-10-11 00:13:51,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-10-11 00:14:05,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=22.5 2023-10-11 00:14:11,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=527613.3333333334, ans=0.0 2023-10-11 00:14:34,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-10-11 00:14:38,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=527706.6666666666, ans=0.125 2023-10-11 00:14:59,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.634e+02 1.812e+02 2.083e+02 2.773e+02, threshold=3.624e+02, percent-clipped=0.0 2023-10-11 00:15:02,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=527846.6666666666, ans=0.125 2023-10-11 00:15:22,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527893.3333333334, ans=0.1 2023-10-11 00:15:26,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=527940.0, ans=0.0 2023-10-11 00:15:26,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0 2023-10-11 00:15:54,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=528033.3333333334, ans=0.0 2023-10-11 00:15:59,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=528080.0, ans=0.0 2023-10-11 00:16:24,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.39 vs. limit=10.0 2023-10-11 00:16:25,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-10-11 00:16:26,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=528173.3333333334, ans=0.125 2023-10-11 00:16:30,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=528220.0, ans=0.0 2023-10-11 00:16:35,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=528220.0, ans=0.125 2023-10-11 00:16:45,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=528266.6666666666, ans=0.125 2023-10-11 00:16:50,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.645e+02 1.994e+02 2.219e+02 3.079e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-11 00:17:04,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=528360.0, ans=0.125 2023-10-11 00:17:15,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=528406.6666666666, ans=0.125 2023-10-11 00:17:17,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-10-11 00:17:20,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=528406.6666666666, ans=0.125 2023-10-11 00:17:26,152 INFO [train.py:1031] (3/4) Epoch 9, batch 4000, loss[loss=0.2113, simple_loss=0.2981, pruned_loss=0.06223, over 16586.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2966, pruned_loss=0.06171, over 28368473.48 frames. ], batch size: 66, lr: 4.07e-03, grad_scale: 32.0 2023-10-11 00:17:32,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=528453.3333333334, ans=0.2 2023-10-11 00:17:36,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=528453.3333333334, ans=0.125 2023-10-11 00:17:38,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=528453.3333333334, ans=0.0 2023-10-11 00:17:43,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=528500.0, ans=0.0 2023-10-11 00:17:48,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=528500.0, ans=0.125 2023-10-11 00:18:00,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=528546.6666666666, ans=0.125 2023-10-11 00:18:00,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=528546.6666666666, ans=0.125 2023-10-11 00:18:04,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=528593.3333333334, ans=0.125 2023-10-11 00:18:06,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-10-11 00:18:10,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528593.3333333334, ans=0.125 2023-10-11 00:18:18,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=528640.0, ans=0.125 2023-10-11 00:18:34,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528686.6666666666, ans=0.1 2023-10-11 00:18:39,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=528733.3333333334, ans=10.0 2023-10-11 00:18:40,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-10-11 00:18:44,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.243e+02 1.771e+02 1.999e+02 2.294e+02 3.223e+02, threshold=3.998e+02, percent-clipped=0.0 2023-10-11 00:18:51,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=528780.0, ans=0.1 2023-10-11 00:18:52,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=528780.0, ans=0.125 2023-10-11 00:19:03,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528826.6666666666, ans=0.125 2023-10-11 00:19:09,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=528873.3333333334, ans=0.125 2023-10-11 00:19:12,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=528873.3333333334, ans=0.125 2023-10-11 00:19:16,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=528873.3333333334, ans=0.125 2023-10-11 00:19:18,860 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-10-11 00:19:39,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=528966.6666666666, ans=0.0 2023-10-11 00:19:52,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=529013.3333333334, ans=0.125 2023-10-11 00:20:09,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529106.6666666666, ans=0.1 2023-10-11 00:20:16,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=15.0 2023-10-11 00:20:23,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=529153.3333333334, ans=0.0 2023-10-11 00:20:28,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=529153.3333333334, ans=0.125 2023-10-11 00:20:40,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.725e+02 1.850e+02 2.121e+02 3.418e+02, threshold=3.700e+02, percent-clipped=0.0 2023-10-11 00:20:59,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=529246.6666666666, ans=0.2 2023-10-11 00:21:21,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529340.0, ans=0.1 2023-10-11 00:21:39,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=529433.3333333334, ans=0.125 2023-10-11 00:21:45,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=529433.3333333334, ans=0.125 2023-10-11 00:21:49,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-11 00:21:58,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=529480.0, ans=0.0 2023-10-11 00:22:41,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.711e+02 1.939e+02 2.267e+02 3.169e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-11 00:22:42,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.32 vs. limit=6.0 2023-10-11 00:22:43,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=529713.3333333334, ans=0.0 2023-10-11 00:22:50,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.45 vs. limit=15.0 2023-10-11 00:22:50,267 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-11 00:22:57,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529760.0, ans=0.1 2023-10-11 00:23:12,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.90 vs. limit=22.5 2023-10-11 00:23:19,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-10-11 00:23:20,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529853.3333333334, ans=0.125 2023-10-11 00:23:23,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=529853.3333333334, ans=0.2 2023-10-11 00:23:34,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.65 vs. limit=10.0 2023-10-11 00:23:43,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2023-10-11 00:23:54,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=529993.3333333334, ans=0.125 2023-10-11 00:24:34,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.713e+02 1.889e+02 2.110e+02 2.902e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-11 00:24:45,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=530180.0, ans=0.0 2023-10-11 00:24:48,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=530180.0, ans=0.125 2023-10-11 00:25:23,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530366.6666666666, ans=0.125 2023-10-11 00:25:25,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530366.6666666666, ans=0.1 2023-10-11 00:25:53,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530460.0, ans=0.1 2023-10-11 00:25:57,375 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:26:15,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530506.6666666666, ans=0.1 2023-10-11 00:26:22,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=530553.3333333334, ans=0.125 2023-10-11 00:26:34,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=530600.0, ans=0.0 2023-10-11 00:26:36,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.735e+02 1.973e+02 2.241e+02 3.564e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-11 00:27:11,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.44 vs. limit=22.5 2023-10-11 00:27:15,066 INFO [train.py:1031] (3/4) Epoch 9, batch 4500, loss[loss=0.193, simple_loss=0.2858, pruned_loss=0.05009, over 16835.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.297, pruned_loss=0.06161, over 29353174.16 frames. ], batch size: 146, lr: 4.07e-03, grad_scale: 32.0 2023-10-11 00:27:30,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.84 vs. limit=10.0 2023-10-11 00:27:32,838 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:27:40,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-10-11 00:27:42,478 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.99 vs. limit=6.0 2023-10-11 00:27:44,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=530880.0, ans=0.125 2023-10-11 00:27:52,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=530926.6666666666, ans=0.125 2023-10-11 00:27:53,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-10-11 00:27:55,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=530926.6666666666, ans=0.125 2023-10-11 00:28:13,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=531020.0, ans=0.2 2023-10-11 00:28:18,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=531020.0, ans=0.0 2023-10-11 00:28:25,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=531066.6666666666, ans=0.04949747468305833 2023-10-11 00:28:26,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.636e+02 1.778e+02 2.064e+02 2.984e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-11 00:28:54,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=531206.6666666666, ans=0.2 2023-10-11 00:28:58,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531206.6666666666, ans=0.1 2023-10-11 00:29:06,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.33 vs. limit=22.5 2023-10-11 00:29:36,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=531393.3333333334, ans=0.0 2023-10-11 00:29:45,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=531440.0, ans=0.0 2023-10-11 00:29:52,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=531440.0, ans=0.125 2023-10-11 00:30:07,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2023-10-11 00:30:10,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=531533.3333333334, ans=0.2 2023-10-11 00:30:13,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.753e+02 1.976e+02 2.319e+02 3.156e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-11 00:30:14,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=531533.3333333334, ans=0.2 2023-10-11 00:30:29,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=531626.6666666666, ans=0.125 2023-10-11 00:30:30,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531626.6666666666, ans=0.125 2023-10-11 00:30:37,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=531626.6666666666, ans=0.125 2023-10-11 00:31:15,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531813.3333333334, ans=0.1 2023-10-11 00:31:25,671 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:31:34,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=531906.6666666666, ans=0.0 2023-10-11 00:31:47,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=531953.3333333334, ans=10.0 2023-10-11 00:31:57,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=532000.0, ans=0.0 2023-10-11 00:32:01,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.670e+02 1.842e+02 2.034e+02 4.194e+02, threshold=3.684e+02, percent-clipped=1.0 2023-10-11 00:32:12,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=532046.6666666666, ans=10.0 2023-10-11 00:32:13,393 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:32:28,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=532140.0, ans=0.2 2023-10-11 00:32:28,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=532140.0, ans=0.125 2023-10-11 00:32:32,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=532140.0, ans=0.2 2023-10-11 00:32:44,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=532186.6666666666, ans=0.125 2023-10-11 00:32:48,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.55 vs. limit=10.0 2023-10-11 00:32:57,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=12.0 2023-10-11 00:33:02,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=532280.0, ans=0.125 2023-10-11 00:33:02,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=532280.0, ans=0.0 2023-10-11 00:33:12,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532326.6666666666, ans=0.0 2023-10-11 00:33:24,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=532373.3333333334, ans=0.0 2023-10-11 00:33:39,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532420.0, ans=0.1 2023-10-11 00:33:49,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=532466.6666666666, ans=0.125 2023-10-11 00:33:56,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.647e+02 1.780e+02 2.048e+02 3.069e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-11 00:34:01,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=532513.3333333334, ans=0.125 2023-10-11 00:34:03,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=532513.3333333334, ans=0.07 2023-10-11 00:34:10,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=532560.0, ans=0.0 2023-10-11 00:34:14,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=532560.0, ans=0.125 2023-10-11 00:34:17,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=532560.0, ans=0.1 2023-10-11 00:34:36,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=532653.3333333334, ans=0.125 2023-10-11 00:34:48,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=532700.0, ans=0.05 2023-10-11 00:34:50,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=532700.0, ans=0.125 2023-10-11 00:35:04,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=532746.6666666666, ans=0.0 2023-10-11 00:35:14,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.48 vs. limit=22.5 2023-10-11 00:35:19,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=532793.3333333334, ans=0.125 2023-10-11 00:35:52,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.734e+02 1.925e+02 2.248e+02 3.988e+02, threshold=3.851e+02, percent-clipped=1.0 2023-10-11 00:35:54,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=532980.0, ans=0.2 2023-10-11 00:36:06,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=533026.6666666666, ans=0.2 2023-10-11 00:36:25,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533073.3333333334, ans=0.125 2023-10-11 00:36:27,546 INFO [train.py:1031] (3/4) Epoch 9, batch 5000, loss[loss=0.2089, simple_loss=0.2934, pruned_loss=0.06222, over 16837.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2967, pruned_loss=0.06178, over 30095041.74 frames. ], batch size: 87, lr: 4.06e-03, grad_scale: 32.0 2023-10-11 00:36:59,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=533213.3333333334, ans=0.125 2023-10-11 00:36:59,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.42 vs. limit=15.0 2023-10-11 00:37:01,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=533260.0, ans=0.2 2023-10-11 00:37:07,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=533260.0, ans=0.0 2023-10-11 00:37:08,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=533260.0, ans=0.125 2023-10-11 00:37:16,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=533306.6666666666, ans=0.125 2023-10-11 00:37:42,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533400.0, ans=0.1 2023-10-11 00:37:43,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.751e+02 2.057e+02 2.441e+02 4.127e+02, threshold=4.115e+02, percent-clipped=2.0 2023-10-11 00:37:55,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=533493.3333333334, ans=0.5 2023-10-11 00:37:56,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2023-10-11 00:38:23,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=533586.6666666666, ans=0.2 2023-10-11 00:38:28,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=533586.6666666666, ans=0.0 2023-10-11 00:38:36,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=22.5 2023-10-11 00:38:40,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=533633.3333333334, ans=0.125 2023-10-11 00:38:40,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=15.0 2023-10-11 00:38:43,297 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:38:43,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=533680.0, ans=0.125 2023-10-11 00:38:46,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=533680.0, ans=0.0 2023-10-11 00:38:48,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=533680.0, ans=0.125 2023-10-11 00:39:29,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=533866.6666666666, ans=0.05 2023-10-11 00:39:31,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=533866.6666666666, ans=0.125 2023-10-11 00:39:36,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.690e+02 1.887e+02 2.069e+02 2.882e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-11 00:40:00,549 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-10-11 00:40:02,946 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:40:13,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=534053.3333333334, ans=0.125 2023-10-11 00:40:16,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=534053.3333333334, ans=0.1 2023-10-11 00:40:19,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=534053.3333333334, ans=0.05 2023-10-11 00:40:21,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=534100.0, ans=0.125 2023-10-11 00:40:28,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=534100.0, ans=0.2 2023-10-11 00:40:34,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=534146.6666666666, ans=0.0 2023-10-11 00:40:37,850 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-10-11 00:41:10,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534286.6666666666, ans=0.1 2023-10-11 00:41:26,586 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=15.0 2023-10-11 00:41:30,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.671e+02 1.836e+02 2.065e+02 2.968e+02, threshold=3.673e+02, percent-clipped=0.0 2023-10-11 00:41:32,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-11 00:41:35,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=534380.0, ans=0.125 2023-10-11 00:41:39,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=534380.0, ans=0.025 2023-10-11 00:42:04,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=534520.0, ans=0.0 2023-10-11 00:42:07,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2023-10-11 00:42:27,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.68 vs. limit=22.5 2023-10-11 00:42:38,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=534660.0, ans=0.125 2023-10-11 00:42:39,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-10-11 00:42:55,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=534706.6666666666, ans=0.0 2023-10-11 00:43:24,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.584e+02 1.697e+02 1.936e+02 2.768e+02, threshold=3.394e+02, percent-clipped=0.0 2023-10-11 00:43:24,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=534846.6666666666, ans=0.125 2023-10-11 00:43:26,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=12.0 2023-10-11 00:43:38,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=534893.3333333334, ans=0.2 2023-10-11 00:43:41,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=534893.3333333334, ans=0.125 2023-10-11 00:43:44,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=534893.3333333334, ans=0.0 2023-10-11 00:44:10,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=535033.3333333334, ans=0.0 2023-10-11 00:44:25,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=535080.0, ans=0.2 2023-10-11 00:44:48,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=535173.3333333334, ans=0.125 2023-10-11 00:44:50,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=535173.3333333334, ans=0.125 2023-10-11 00:44:55,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.41 vs. limit=15.0 2023-10-11 00:44:59,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=535220.0, ans=0.125 2023-10-11 00:45:03,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.80 vs. limit=10.0 2023-10-11 00:45:13,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535266.6666666666, ans=0.1 2023-10-11 00:45:14,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.679e+02 1.959e+02 2.239e+02 3.797e+02, threshold=3.917e+02, percent-clipped=2.0 2023-10-11 00:45:21,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=535313.3333333334, ans=0.125 2023-10-11 00:45:23,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=535313.3333333334, ans=0.125 2023-10-11 00:45:26,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=535360.0, ans=0.0 2023-10-11 00:45:38,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=535406.6666666666, ans=0.0 2023-10-11 00:45:47,544 INFO [train.py:1031] (3/4) Epoch 9, batch 5500, loss[loss=0.1958, simple_loss=0.2933, pruned_loss=0.04919, over 16818.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2963, pruned_loss=0.06146, over 30690877.88 frames. ], batch size: 175, lr: 4.05e-03, grad_scale: 16.0 2023-10-11 00:45:49,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535453.3333333334, ans=0.1 2023-10-11 00:45:50,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.17 vs. limit=15.0 2023-10-11 00:45:56,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=535453.3333333334, ans=0.2 2023-10-11 00:46:02,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=535500.0, ans=0.125 2023-10-11 00:46:04,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.81 vs. limit=22.5 2023-10-11 00:46:17,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=535546.6666666666, ans=0.0 2023-10-11 00:46:17,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=535546.6666666666, ans=0.0 2023-10-11 00:46:20,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2023-10-11 00:46:26,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=535593.3333333334, ans=0.95 2023-10-11 00:46:37,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=535640.0, ans=0.125 2023-10-11 00:46:38,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.62 vs. limit=15.0 2023-10-11 00:46:42,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=535686.6666666666, ans=0.2 2023-10-11 00:47:02,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=535733.3333333334, ans=0.125 2023-10-11 00:47:03,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.583e+02 1.694e+02 1.878e+02 3.011e+02, threshold=3.388e+02, percent-clipped=0.0 2023-10-11 00:47:09,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535780.0, ans=0.1 2023-10-11 00:47:20,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=535826.6666666666, ans=0.2 2023-10-11 00:47:21,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535826.6666666666, ans=0.1 2023-10-11 00:47:50,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=535966.6666666666, ans=0.125 2023-10-11 00:48:02,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=536013.3333333334, ans=0.0 2023-10-11 00:48:14,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=536060.0, ans=0.125 2023-10-11 00:48:38,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=536153.3333333334, ans=0.0 2023-10-11 00:48:39,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=536153.3333333334, ans=0.125 2023-10-11 00:48:56,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.780e+02 1.998e+02 2.427e+02 3.170e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-11 00:49:02,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=536246.6666666666, ans=0.0 2023-10-11 00:49:09,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-10-11 00:49:23,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=536340.0, ans=0.125 2023-10-11 00:49:23,387 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:49:34,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=15.0 2023-10-11 00:49:36,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=536386.6666666666, ans=0.125 2023-10-11 00:49:47,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=536433.3333333334, ans=0.125 2023-10-11 00:50:03,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=536480.0, ans=0.5 2023-10-11 00:50:11,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=536526.6666666666, ans=0.07 2023-10-11 00:50:17,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.65 vs. limit=15.0 2023-10-11 00:50:19,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-11 00:50:50,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.652e+02 1.816e+02 2.002e+02 2.861e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-11 00:50:50,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=536713.3333333334, ans=0.2 2023-10-11 00:50:50,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=536713.3333333334, ans=0.125 2023-10-11 00:51:03,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=536760.0, ans=0.0 2023-10-11 00:51:08,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=22.5 2023-10-11 00:51:18,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=536806.6666666666, ans=0.125 2023-10-11 00:51:23,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=536806.6666666666, ans=0.1 2023-10-11 00:51:45,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-10-11 00:51:58,563 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:52:00,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=536993.3333333334, ans=0.0 2023-10-11 00:52:29,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-10-11 00:52:29,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=537086.6666666666, ans=0.125 2023-10-11 00:52:47,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.661e+02 1.841e+02 2.063e+02 2.970e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-11 00:52:55,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-10-11 00:53:01,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537226.6666666666, ans=0.1 2023-10-11 00:53:01,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=537226.6666666666, ans=0.0 2023-10-11 00:53:02,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-10-11 00:53:16,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=537273.3333333334, ans=0.2 2023-10-11 00:53:20,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=537320.0, ans=0.125 2023-10-11 00:53:26,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=537320.0, ans=0.09899494936611666 2023-10-11 00:54:12,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=537506.6666666666, ans=0.125 2023-10-11 00:54:13,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=22.5 2023-10-11 00:54:21,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-11 00:54:32,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.76 vs. limit=22.5 2023-10-11 00:54:39,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.684e+02 1.834e+02 2.014e+02 3.437e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-11 00:54:55,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=537693.3333333334, ans=0.0 2023-10-11 00:54:55,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=537693.3333333334, ans=0.2 2023-10-11 00:54:58,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-10-11 00:55:12,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=537740.0, ans=0.2 2023-10-11 00:55:15,377 INFO [train.py:1031] (3/4) Epoch 9, batch 6000, loss[loss=0.2182, simple_loss=0.2788, pruned_loss=0.07883, over 12323.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2967, pruned_loss=0.0617, over 31159613.40 frames. ], batch size: 440, lr: 4.04e-03, grad_scale: 32.0 2023-10-11 00:55:25,168 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-10-11 00:55:25,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=537833.3333333334, ans=0.07 2023-10-11 00:56:35,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.745e+02 1.985e+02 2.243e+02 3.608e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-11 00:56:39,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-10-11 00:56:42,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=538113.3333333334, ans=0.1 2023-10-11 00:56:43,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=538160.0, ans=0.035 2023-10-11 00:56:44,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538160.0, ans=0.0 2023-10-11 00:56:53,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-10-11 00:57:31,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=538346.6666666666, ans=0.0 2023-10-11 00:57:37,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538393.3333333334, ans=0.1 2023-10-11 00:57:41,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.16 vs. limit=15.0 2023-10-11 00:57:52,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-10-11 00:58:08,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=538486.6666666666, ans=0.125 2023-10-11 00:58:20,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=538580.0, ans=0.025 2023-10-11 00:58:20,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.637e+02 1.787e+02 2.010e+02 2.623e+02, threshold=3.574e+02, percent-clipped=0.0 2023-10-11 00:58:23,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=12.0 2023-10-11 00:58:38,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=538626.6666666666, ans=15.0 2023-10-11 00:59:11,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=538766.6666666666, ans=0.0 2023-10-11 00:59:17,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=538813.3333333334, ans=0.125 2023-10-11 00:59:24,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-10-11 00:59:29,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-10-11 00:59:49,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538953.3333333334, ans=0.125 2023-10-11 00:59:53,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=22.5 2023-10-11 00:59:53,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=538953.3333333334, ans=0.125 2023-10-11 00:59:56,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=538953.3333333334, ans=15.0 2023-10-11 01:00:11,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=539046.6666666666, ans=0.0 2023-10-11 01:00:12,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.783e+02 1.870e+02 2.083e+02 2.738e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 01:00:30,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=539093.3333333334, ans=0.125 2023-10-11 01:00:32,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=539093.3333333334, ans=0.1 2023-10-11 01:00:45,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=539186.6666666666, ans=0.0 2023-10-11 01:00:46,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=539186.6666666666, ans=0.025 2023-10-11 01:00:54,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.11 vs. limit=15.0 2023-10-11 01:01:14,674 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:01:48,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=539420.0, ans=0.1 2023-10-11 01:01:48,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=539420.0, ans=0.125 2023-10-11 01:01:59,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=539466.6666666666, ans=0.125 2023-10-11 01:02:12,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.711e+02 1.934e+02 2.190e+02 3.211e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-11 01:02:14,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=539513.3333333334, ans=0.125 2023-10-11 01:02:46,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=539606.6666666666, ans=0.1 2023-10-11 01:03:07,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=539700.0, ans=0.2 2023-10-11 01:03:14,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-11 01:03:22,118 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:03:30,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=539840.0, ans=0.125 2023-10-11 01:03:35,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-10-11 01:03:37,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=539840.0, ans=0.2 2023-10-11 01:03:38,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.08 vs. limit=10.0 2023-10-11 01:03:42,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=539886.6666666666, ans=0.05 2023-10-11 01:03:53,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=15.0 2023-10-11 01:03:56,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539933.3333333334, ans=0.1 2023-10-11 01:04:03,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.293e+02 1.699e+02 1.867e+02 2.102e+02 2.991e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 01:04:08,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-10-11 01:04:12,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=539980.0, ans=0.0 2023-10-11 01:04:20,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-11 01:04:21,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=540026.6666666666, ans=0.2 2023-10-11 01:04:27,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=540073.3333333334, ans=0.2 2023-10-11 01:04:29,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=540073.3333333334, ans=0.0 2023-10-11 01:04:38,230 INFO [train.py:1031] (3/4) Epoch 9, batch 6500, loss[loss=0.2356, simple_loss=0.3258, pruned_loss=0.07265, over 16867.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2972, pruned_loss=0.06176, over 31536347.73 frames. ], batch size: 165, lr: 4.03e-03, grad_scale: 32.0 2023-10-11 01:04:50,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=540166.6666666666, ans=0.0 2023-10-11 01:05:47,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=540353.3333333334, ans=0.0 2023-10-11 01:05:49,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=540353.3333333334, ans=0.125 2023-10-11 01:05:50,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=540353.3333333334, ans=0.125 2023-10-11 01:06:00,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=540400.0, ans=0.125 2023-10-11 01:06:02,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=540400.0, ans=0.0 2023-10-11 01:06:08,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.718e+02 1.914e+02 2.137e+02 2.962e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-11 01:06:17,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.60 vs. limit=22.5 2023-10-11 01:07:08,975 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:07:25,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=540773.3333333334, ans=0.125 2023-10-11 01:07:26,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=540773.3333333334, ans=0.95 2023-10-11 01:07:31,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-10-11 01:07:45,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=540820.0, ans=0.2 2023-10-11 01:07:50,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.14 vs. limit=22.5 2023-10-11 01:07:59,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.691e+02 1.919e+02 2.307e+02 3.643e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-11 01:08:05,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=540913.3333333334, ans=0.125 2023-10-11 01:08:05,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-10-11 01:08:27,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=541006.6666666666, ans=0.0 2023-10-11 01:09:38,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=541286.6666666666, ans=0.125 2023-10-11 01:09:41,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=541333.3333333334, ans=0.09899494936611666 2023-10-11 01:09:51,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.618e+02 1.768e+02 1.941e+02 2.761e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-11 01:10:27,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=541520.0, ans=0.0 2023-10-11 01:10:53,484 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:10:59,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=541566.6666666666, ans=0.125 2023-10-11 01:11:00,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=541566.6666666666, ans=0.125 2023-10-11 01:11:10,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=541613.3333333334, ans=0.025 2023-10-11 01:11:14,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=541660.0, ans=0.0 2023-10-11 01:11:50,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=541800.0, ans=0.125 2023-10-11 01:11:54,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.46 vs. limit=15.0 2023-10-11 01:12:02,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=541846.6666666666, ans=0.0 2023-10-11 01:12:02,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.682e+02 1.850e+02 2.528e+02 4.430e+02, threshold=3.701e+02, percent-clipped=5.0 2023-10-11 01:12:10,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=541846.6666666666, ans=0.125 2023-10-11 01:12:19,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=541893.3333333334, ans=0.125 2023-10-11 01:12:24,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=541940.0, ans=0.0 2023-10-11 01:12:34,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=541986.6666666666, ans=0.0 2023-10-11 01:12:44,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=541986.6666666666, ans=15.0 2023-10-11 01:12:50,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=542033.3333333334, ans=0.0 2023-10-11 01:12:51,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=542033.3333333334, ans=0.125 2023-10-11 01:12:58,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=542080.0, ans=0.125 2023-10-11 01:13:06,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542080.0, ans=0.1 2023-10-11 01:13:41,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=542266.6666666666, ans=0.0 2023-10-11 01:13:47,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=542266.6666666666, ans=0.125 2023-10-11 01:13:51,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2023-10-11 01:13:54,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.631e+02 1.824e+02 1.992e+02 2.818e+02, threshold=3.648e+02, percent-clipped=0.0 2023-10-11 01:14:16,701 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.42 vs. limit=6.0 2023-10-11 01:14:17,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=542406.6666666666, ans=0.125 2023-10-11 01:14:25,695 INFO [train.py:1031] (3/4) Epoch 9, batch 7000, loss[loss=0.2542, simple_loss=0.3213, pruned_loss=0.09354, over 16040.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2976, pruned_loss=0.0616, over 31845194.49 frames. ], batch size: 297, lr: 4.02e-03, grad_scale: 32.0 2023-10-11 01:14:30,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=542453.3333333334, ans=0.125 2023-10-11 01:14:53,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=542546.6666666666, ans=0.0 2023-10-11 01:15:06,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=542593.3333333334, ans=0.0 2023-10-11 01:15:08,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=542593.3333333334, ans=0.125 2023-10-11 01:15:30,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2023-10-11 01:15:42,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=542733.3333333334, ans=0.125 2023-10-11 01:15:50,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.742e+02 1.936e+02 2.147e+02 2.999e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-11 01:15:53,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=542780.0, ans=0.0 2023-10-11 01:16:00,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542780.0, ans=0.1 2023-10-11 01:16:02,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=542826.6666666666, ans=0.125 2023-10-11 01:16:06,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=542826.6666666666, ans=0.125 2023-10-11 01:16:17,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=542873.3333333334, ans=0.125 2023-10-11 01:16:29,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=542920.0, ans=0.1 2023-10-11 01:16:50,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.98 vs. limit=15.0 2023-10-11 01:17:01,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=543060.0, ans=0.0 2023-10-11 01:17:12,069 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-11 01:17:26,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=543153.3333333334, ans=0.125 2023-10-11 01:17:35,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=543200.0, ans=0.0 2023-10-11 01:17:45,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.725e+02 1.902e+02 2.077e+02 3.151e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-11 01:17:49,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543246.6666666666, ans=0.1 2023-10-11 01:18:15,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543386.6666666666, ans=0.125 2023-10-11 01:18:32,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=543386.6666666666, ans=0.0 2023-10-11 01:18:38,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=543433.3333333334, ans=0.0 2023-10-11 01:18:46,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=543433.3333333334, ans=0.0 2023-10-11 01:18:48,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=543433.3333333334, ans=10.0 2023-10-11 01:19:55,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.672e+02 1.862e+02 2.136e+02 3.132e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-11 01:20:25,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.24 vs. limit=15.0 2023-10-11 01:20:50,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=543946.6666666666, ans=0.2 2023-10-11 01:20:54,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=543946.6666666666, ans=0.0 2023-10-11 01:21:08,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.04 vs. limit=22.5 2023-10-11 01:21:14,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.40 vs. limit=22.5 2023-10-11 01:21:28,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=544086.6666666666, ans=0.0 2023-10-11 01:21:29,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=544086.6666666666, ans=0.125 2023-10-11 01:21:43,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=544133.3333333334, ans=0.125 2023-10-11 01:21:49,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.214e+02 1.640e+02 1.778e+02 2.042e+02 2.909e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-11 01:22:07,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-11 01:22:17,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=544273.3333333334, ans=0.0 2023-10-11 01:22:24,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=544320.0, ans=0.125 2023-10-11 01:22:37,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2023-10-11 01:22:41,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=544366.6666666666, ans=0.2 2023-10-11 01:22:44,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=544413.3333333334, ans=0.1 2023-10-11 01:22:51,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=544413.3333333334, ans=0.0 2023-10-11 01:22:58,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=544460.0, ans=0.0 2023-10-11 01:23:03,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=544460.0, ans=0.125 2023-10-11 01:23:22,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-10-11 01:23:23,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544553.3333333334, ans=0.1 2023-10-11 01:23:40,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.790e+02 1.905e+02 2.110e+02 4.515e+02, threshold=3.809e+02, percent-clipped=1.0 2023-10-11 01:24:13,461 INFO [train.py:1031] (3/4) Epoch 9, batch 7500, loss[loss=0.2156, simple_loss=0.3034, pruned_loss=0.06391, over 16888.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2975, pruned_loss=0.06172, over 32039182.17 frames. ], batch size: 155, lr: 4.01e-03, grad_scale: 32.0 2023-10-11 01:24:21,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=544786.6666666666, ans=0.125 2023-10-11 01:24:26,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544833.3333333334, ans=0.1 2023-10-11 01:24:27,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=544833.3333333334, ans=0.125 2023-10-11 01:25:02,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=544973.3333333334, ans=0.125 2023-10-11 01:25:10,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=22.5 2023-10-11 01:25:15,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.11 vs. limit=10.0 2023-10-11 01:25:18,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=545020.0, ans=0.125 2023-10-11 01:25:24,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2023-10-11 01:25:33,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.735e+02 1.937e+02 2.293e+02 3.427e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-11 01:25:39,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=545113.3333333334, ans=0.1 2023-10-11 01:25:48,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=545160.0, ans=0.1 2023-10-11 01:25:48,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=22.5 2023-10-11 01:25:54,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545206.6666666666, ans=0.1 2023-10-11 01:25:54,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=545206.6666666666, ans=0.125 2023-10-11 01:26:17,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545300.0, ans=0.1 2023-10-11 01:26:34,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=545346.6666666666, ans=0.125 2023-10-11 01:26:43,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=545393.3333333334, ans=0.125 2023-10-11 01:26:44,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-10-11 01:26:49,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=545393.3333333334, ans=0.5 2023-10-11 01:27:09,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-10-11 01:27:14,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=545486.6666666666, ans=0.09899494936611666 2023-10-11 01:27:23,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=545533.3333333334, ans=0.125 2023-10-11 01:27:24,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=545533.3333333334, ans=0.125 2023-10-11 01:27:30,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=545533.3333333334, ans=0.0 2023-10-11 01:27:36,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.646e+02 1.821e+02 2.108e+02 2.790e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-11 01:27:36,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=545580.0, ans=0.2 2023-10-11 01:27:50,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-10-11 01:27:55,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=3.98 vs. limit=10.0 2023-10-11 01:28:14,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545720.0, ans=0.1 2023-10-11 01:28:25,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=545766.6666666666, ans=0.2 2023-10-11 01:28:26,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=545766.6666666666, ans=0.0 2023-10-11 01:28:33,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.81 vs. limit=15.0 2023-10-11 01:28:43,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-11 01:28:50,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=545906.6666666666, ans=0.95 2023-10-11 01:29:19,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=546000.0, ans=0.125 2023-10-11 01:29:25,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.680e+02 1.830e+02 2.088e+02 3.460e+02, threshold=3.661e+02, percent-clipped=0.0 2023-10-11 01:29:41,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=546093.3333333334, ans=0.0 2023-10-11 01:29:55,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=546140.0, ans=0.0 2023-10-11 01:30:25,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=546280.0, ans=0.125 2023-10-11 01:31:07,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-10-11 01:31:16,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-10-11 01:31:20,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.733e+02 1.916e+02 2.079e+02 2.951e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 01:31:54,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=546653.3333333334, ans=0.0 2023-10-11 01:31:58,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=546653.3333333334, ans=0.0 2023-10-11 01:32:41,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-10-11 01:32:46,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=546840.0, ans=0.0 2023-10-11 01:32:54,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=546886.6666666666, ans=0.1 2023-10-11 01:33:05,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.00 vs. limit=10.0 2023-10-11 01:33:15,677 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:33:17,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.671e+02 1.915e+02 2.095e+02 2.809e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-11 01:33:34,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547026.6666666666, ans=0.1 2023-10-11 01:33:50,010 INFO [train.py:1031] (3/4) Epoch 9, batch 8000, loss[loss=0.1852, simple_loss=0.2776, pruned_loss=0.04643, over 16847.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2969, pruned_loss=0.06117, over 32219221.30 frames. ], batch size: 155, lr: 4.00e-03, grad_scale: 32.0 2023-10-11 01:33:50,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=547120.0, ans=0.2 2023-10-11 01:33:50,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=547120.0, ans=0.1 2023-10-11 01:34:07,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=547166.6666666666, ans=0.0 2023-10-11 01:34:41,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=547353.3333333334, ans=0.2 2023-10-11 01:34:53,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=547400.0, ans=0.2 2023-10-11 01:34:56,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.14 vs. limit=15.0 2023-10-11 01:35:03,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547400.0, ans=0.1 2023-10-11 01:35:06,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.632e+02 1.817e+02 2.096e+02 2.954e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-11 01:35:12,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-10-11 01:35:18,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547493.3333333334, ans=0.1 2023-10-11 01:35:18,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=547493.3333333334, ans=0.05 2023-10-11 01:35:26,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=547493.3333333334, ans=0.125 2023-10-11 01:35:28,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=547540.0, ans=0.125 2023-10-11 01:35:38,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=547540.0, ans=0.0 2023-10-11 01:35:47,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=547586.6666666666, ans=0.2 2023-10-11 01:36:33,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=547773.3333333334, ans=0.125 2023-10-11 01:36:33,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547773.3333333334, ans=0.1 2023-10-11 01:36:45,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=547820.0, ans=0.0 2023-10-11 01:37:13,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.697e+02 1.834e+02 2.076e+02 3.082e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 01:37:59,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=548053.3333333334, ans=0.125 2023-10-11 01:38:20,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=548146.6666666666, ans=0.2 2023-10-11 01:38:32,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=548193.3333333334, ans=0.0 2023-10-11 01:38:46,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=548240.0, ans=0.0 2023-10-11 01:38:58,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=548286.6666666666, ans=0.125 2023-10-11 01:39:01,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-10-11 01:39:06,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=548333.3333333334, ans=0.125 2023-10-11 01:39:06,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=548333.3333333334, ans=0.0 2023-10-11 01:39:12,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.774e+02 1.966e+02 2.237e+02 3.257e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-11 01:39:24,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548426.6666666666, ans=0.1 2023-10-11 01:39:28,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=548426.6666666666, ans=0.125 2023-10-11 01:39:39,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.82 vs. limit=10.0 2023-10-11 01:39:44,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=548473.3333333334, ans=0.125 2023-10-11 01:40:01,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=548566.6666666666, ans=0.0 2023-10-11 01:40:06,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=548566.6666666666, ans=0.2 2023-10-11 01:40:17,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=548613.3333333334, ans=0.125 2023-10-11 01:40:17,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=548660.0, ans=0.0 2023-10-11 01:40:18,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=548660.0, ans=0.125 2023-10-11 01:40:32,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548706.6666666666, ans=0.1 2023-10-11 01:40:34,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=548706.6666666666, ans=0.07 2023-10-11 01:40:35,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2023-10-11 01:40:39,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-11 01:40:52,304 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2023-10-11 01:40:55,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=548800.0, ans=0.2 2023-10-11 01:40:58,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=548800.0, ans=0.125 2023-10-11 01:41:04,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.690e+02 1.883e+02 2.031e+02 2.895e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-11 01:41:18,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=548893.3333333334, ans=0.125 2023-10-11 01:41:37,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=548986.6666666666, ans=0.125 2023-10-11 01:41:55,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=549033.3333333334, ans=0.125 2023-10-11 01:42:44,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=549220.0, ans=0.05 2023-10-11 01:42:47,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=549266.6666666666, ans=0.125 2023-10-11 01:42:58,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.780e+02 1.969e+02 2.317e+02 3.856e+02, threshold=3.938e+02, percent-clipped=1.0 2023-10-11 01:43:21,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=549360.0, ans=0.125 2023-10-11 01:43:27,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=549406.6666666666, ans=0.09899494936611666 2023-10-11 01:43:33,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=549406.6666666666, ans=0.2 2023-10-11 01:43:36,456 INFO [train.py:1031] (3/4) Epoch 9, batch 8500, loss[loss=0.2005, simple_loss=0.2966, pruned_loss=0.05218, over 15889.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2972, pruned_loss=0.06118, over 32346028.52 frames. ], batch size: 43, lr: 4.00e-03, grad_scale: 64.0 2023-10-11 01:43:49,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=549500.0, ans=0.0 2023-10-11 01:43:52,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=549500.0, ans=0.125 2023-10-11 01:44:29,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=549640.0, ans=0.125 2023-10-11 01:44:35,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-11 01:44:36,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=549686.6666666666, ans=0.125 2023-10-11 01:44:47,689 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-10-11 01:44:59,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.746e+02 1.893e+02 2.167e+02 3.169e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-11 01:45:37,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=549920.0, ans=0.125 2023-10-11 01:46:00,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=549966.6666666666, ans=0.125 2023-10-11 01:46:16,020 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:46:25,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.32 vs. limit=22.5 2023-10-11 01:46:30,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=550106.6666666666, ans=0.0 2023-10-11 01:46:33,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=550106.6666666666, ans=0.125 2023-10-11 01:46:34,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=550153.3333333334, ans=0.0 2023-10-11 01:46:41,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=550153.3333333334, ans=0.1 2023-10-11 01:46:51,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.66 vs. limit=15.0 2023-10-11 01:47:02,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.646e+02 1.777e+02 1.978e+02 2.733e+02, threshold=3.554e+02, percent-clipped=0.0 2023-10-11 01:47:15,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=550293.3333333334, ans=0.125 2023-10-11 01:47:27,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.59 vs. limit=10.0 2023-10-11 01:47:30,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=550340.0, ans=0.125 2023-10-11 01:47:32,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=550340.0, ans=0.0 2023-10-11 01:47:33,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550340.0, ans=0.1 2023-10-11 01:47:51,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=550433.3333333334, ans=0.125 2023-10-11 01:47:59,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.28 vs. limit=10.0 2023-10-11 01:48:12,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=550480.0, ans=0.125 2023-10-11 01:48:19,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=550526.6666666666, ans=0.0 2023-10-11 01:48:30,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=550573.3333333334, ans=0.05 2023-10-11 01:49:06,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=550666.6666666666, ans=0.2 2023-10-11 01:49:10,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.592e+02 1.763e+02 1.986e+02 2.897e+02, threshold=3.527e+02, percent-clipped=0.0 2023-10-11 01:49:15,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=550713.3333333334, ans=0.0 2023-10-11 01:49:19,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=550760.0, ans=0.0 2023-10-11 01:49:21,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=550760.0, ans=0.2 2023-10-11 01:49:53,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=550900.0, ans=0.125 2023-10-11 01:50:06,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=550946.6666666666, ans=0.125 2023-10-11 01:50:14,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-10-11 01:50:18,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-10-11 01:50:56,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=551180.0, ans=0.0 2023-10-11 01:50:57,509 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.805e+02 2.003e+02 2.407e+02 3.310e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-11 01:51:12,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=551226.6666666666, ans=0.0 2023-10-11 01:51:19,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=551273.3333333334, ans=0.125 2023-10-11 01:51:27,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=551273.3333333334, ans=0.125 2023-10-11 01:51:36,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.51 vs. limit=15.0 2023-10-11 01:51:51,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=551413.3333333334, ans=0.0 2023-10-11 01:52:19,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.89 vs. limit=15.0 2023-10-11 01:52:24,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=551553.3333333334, ans=0.0 2023-10-11 01:52:47,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.702e+02 1.922e+02 2.157e+02 2.775e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-11 01:53:01,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=551693.3333333334, ans=0.0 2023-10-11 01:53:02,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=551693.3333333334, ans=0.125 2023-10-11 01:53:16,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=551740.0, ans=0.07 2023-10-11 01:53:19,785 INFO [train.py:1031] (3/4) Epoch 9, batch 9000, loss[loss=0.2275, simple_loss=0.3164, pruned_loss=0.06929, over 16854.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2964, pruned_loss=0.06073, over 32467778.73 frames. ], batch size: 155, lr: 3.99e-03, grad_scale: 64.0 2023-10-11 01:54:08,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=15.0 2023-10-11 01:54:14,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=552020.0, ans=0.1 2023-10-11 01:54:17,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=552020.0, ans=0.0 2023-10-11 01:54:25,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552066.6666666666, ans=0.1 2023-10-11 01:54:37,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.649e+02 1.808e+02 1.972e+02 2.800e+02, threshold=3.615e+02, percent-clipped=0.0 2023-10-11 01:54:37,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=552113.3333333334, ans=0.0 2023-10-11 01:54:49,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.06 vs. limit=15.0 2023-10-11 01:55:42,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552393.3333333334, ans=0.1 2023-10-11 01:55:51,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=552440.0, ans=0.125 2023-10-11 01:55:54,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-11 01:56:03,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=552486.6666666666, ans=0.125 2023-10-11 01:56:24,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.812e+02 1.991e+02 2.323e+02 3.468e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-11 01:56:50,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=552673.3333333334, ans=0.125 2023-10-11 01:57:00,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=552720.0, ans=0.125 2023-10-11 01:57:19,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=552813.3333333334, ans=0.0 2023-10-11 01:57:34,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=552860.0, ans=0.0 2023-10-11 01:57:36,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=552860.0, ans=0.125 2023-10-11 01:57:50,388 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:57:52,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=552953.3333333334, ans=0.125 2023-10-11 01:58:09,543 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:58:13,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.716e+02 1.880e+02 2.089e+02 2.681e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-11 01:58:22,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=553093.3333333334, ans=0.125 2023-10-11 01:58:35,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553140.0, ans=0.125 2023-10-11 01:58:36,489 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:58:39,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=553140.0, ans=0.125 2023-10-11 01:58:42,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=553186.6666666666, ans=0.125 2023-10-11 01:58:55,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=553233.3333333334, ans=0.125 2023-10-11 01:59:30,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=553373.3333333334, ans=0.125 2023-10-11 01:59:35,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=553373.3333333334, ans=0.125 2023-10-11 01:59:36,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=553373.3333333334, ans=0.125 2023-10-11 01:59:46,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-11 01:59:49,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=553420.0, ans=0.0 2023-10-11 02:00:11,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553513.3333333334, ans=0.125 2023-10-11 02:00:12,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.782e+02 1.935e+02 2.195e+02 3.246e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-11 02:00:17,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=553513.3333333334, ans=0.0 2023-10-11 02:01:40,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=553840.0, ans=0.025 2023-10-11 02:02:11,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.713e+02 1.908e+02 2.250e+02 3.753e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 02:02:24,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=554026.6666666666, ans=0.125 2023-10-11 02:02:25,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554026.6666666666, ans=0.125 2023-10-11 02:02:31,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=554026.6666666666, ans=0.2 2023-10-11 02:02:32,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=554073.3333333334, ans=0.0 2023-10-11 02:02:39,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=554073.3333333334, ans=0.0 2023-10-11 02:02:43,895 INFO [train.py:1031] (3/4) Epoch 9, batch 9500, loss[loss=0.2108, simple_loss=0.2959, pruned_loss=0.06289, over 16506.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2969, pruned_loss=0.06103, over 32523956.53 frames. ], batch size: 56, lr: 3.98e-03, grad_scale: 16.0 2023-10-11 02:02:57,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=554166.6666666666, ans=10.0 2023-10-11 02:03:10,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=554213.3333333334, ans=0.07 2023-10-11 02:03:15,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=554260.0, ans=0.2 2023-10-11 02:03:23,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554260.0, ans=0.1 2023-10-11 02:03:23,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=554260.0, ans=0.125 2023-10-11 02:03:43,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=554353.3333333334, ans=0.125 2023-10-11 02:03:52,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=554400.0, ans=0.125 2023-10-11 02:03:56,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554400.0, ans=0.1 2023-10-11 02:04:03,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.642e+02 1.783e+02 1.987e+02 2.928e+02, threshold=3.566e+02, percent-clipped=0.0 2023-10-11 02:04:04,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=554446.6666666666, ans=0.125 2023-10-11 02:04:07,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=554446.6666666666, ans=0.0 2023-10-11 02:04:08,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-10-11 02:04:12,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=554493.3333333334, ans=0.125 2023-10-11 02:04:29,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=554540.0, ans=0.05 2023-10-11 02:04:41,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=554586.6666666666, ans=0.2 2023-10-11 02:04:42,684 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:04:48,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=554633.3333333334, ans=0.0 2023-10-11 02:05:01,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-10-11 02:05:08,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-10-11 02:05:09,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=554726.6666666666, ans=0.125 2023-10-11 02:05:19,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=554773.3333333334, ans=0.1 2023-10-11 02:05:24,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=554773.3333333334, ans=0.0 2023-10-11 02:05:42,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2023-10-11 02:05:54,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554913.3333333334, ans=0.125 2023-10-11 02:05:57,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.693e+02 1.868e+02 2.087e+02 2.830e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 02:06:09,427 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:06:10,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=554960.0, ans=0.125 2023-10-11 02:06:25,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.33 vs. limit=15.0 2023-10-11 02:06:26,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=555053.3333333334, ans=0.125 2023-10-11 02:06:29,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555053.3333333334, ans=0.125 2023-10-11 02:06:35,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=555053.3333333334, ans=0.0 2023-10-11 02:06:37,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=555100.0, ans=0.0 2023-10-11 02:06:40,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0 2023-10-11 02:06:45,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=555100.0, ans=0.125 2023-10-11 02:06:54,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555146.6666666666, ans=0.1 2023-10-11 02:07:09,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-10-11 02:07:20,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=555240.0, ans=0.125 2023-10-11 02:07:31,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-10-11 02:07:41,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=555333.3333333334, ans=0.0 2023-10-11 02:07:47,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.652e+02 1.883e+02 2.236e+02 3.802e+02, threshold=3.766e+02, percent-clipped=1.0 2023-10-11 02:08:02,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-11 02:08:19,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-11 02:08:42,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555613.3333333334, ans=0.125 2023-10-11 02:09:00,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=555660.0, ans=0.0 2023-10-11 02:09:32,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-10-11 02:09:40,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=555846.6666666666, ans=0.0 2023-10-11 02:09:41,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.700e+02 1.941e+02 2.223e+02 3.027e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-11 02:09:44,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=555846.6666666666, ans=0.125 2023-10-11 02:09:49,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=555893.3333333334, ans=0.125 2023-10-11 02:10:04,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=555940.0, ans=0.0 2023-10-11 02:10:15,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=555986.6666666666, ans=0.125 2023-10-11 02:10:22,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=556033.3333333334, ans=0.05 2023-10-11 02:10:23,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=556033.3333333334, ans=0.125 2023-10-11 02:10:24,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=556033.3333333334, ans=0.0 2023-10-11 02:10:38,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=556080.0, ans=0.125 2023-10-11 02:10:44,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=556126.6666666666, ans=0.2 2023-10-11 02:10:51,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=556126.6666666666, ans=0.0 2023-10-11 02:10:55,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=556173.3333333334, ans=0.125 2023-10-11 02:10:56,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=556173.3333333334, ans=0.125 2023-10-11 02:11:04,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=556220.0, ans=0.0 2023-10-11 02:11:09,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=556220.0, ans=0.2 2023-10-11 02:11:14,157 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:11:30,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.718e+02 1.934e+02 2.146e+02 3.258e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-11 02:11:38,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=556360.0, ans=0.07 2023-10-11 02:11:40,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.63 vs. limit=15.0 2023-10-11 02:11:53,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=556406.6666666666, ans=0.125 2023-10-11 02:11:58,935 INFO [train.py:1031] (3/4) Epoch 9, batch 10000, loss[loss=0.2251, simple_loss=0.3105, pruned_loss=0.06984, over 16817.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.296, pruned_loss=0.06062, over 32594735.89 frames. ], batch size: 188, lr: 3.97e-03, grad_scale: 32.0 2023-10-11 02:12:07,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=556500.0, ans=0.0 2023-10-11 02:12:47,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=556640.0, ans=0.125 2023-10-11 02:13:16,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=556780.0, ans=0.0 2023-10-11 02:13:17,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.710e+02 1.863e+02 2.164e+02 3.346e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 02:13:26,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=556826.6666666666, ans=0.125 2023-10-11 02:13:33,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=556826.6666666666, ans=0.04949747468305833 2023-10-11 02:13:34,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=556826.6666666666, ans=0.1 2023-10-11 02:13:35,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=556826.6666666666, ans=0.2 2023-10-11 02:13:43,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=556873.3333333334, ans=0.0 2023-10-11 02:13:44,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=556873.3333333334, ans=0.1 2023-10-11 02:13:46,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=556873.3333333334, ans=0.0 2023-10-11 02:13:58,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=556920.0, ans=0.1 2023-10-11 02:14:06,895 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:14:07,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=556966.6666666666, ans=0.0 2023-10-11 02:14:15,299 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-10-11 02:14:16,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.24 vs. limit=15.0 2023-10-11 02:14:52,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-11 02:14:54,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=557200.0, ans=0.125 2023-10-11 02:15:09,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.824e+02 2.054e+02 2.470e+02 3.937e+02, threshold=4.107e+02, percent-clipped=1.0 2023-10-11 02:15:23,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=557293.3333333334, ans=0.1 2023-10-11 02:15:23,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=557293.3333333334, ans=0.0 2023-10-11 02:15:31,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=557340.0, ans=0.125 2023-10-11 02:15:34,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=557340.0, ans=0.125 2023-10-11 02:15:45,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=557386.6666666666, ans=0.125 2023-10-11 02:16:11,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=557480.0, ans=0.1 2023-10-11 02:16:23,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=557526.6666666666, ans=0.0 2023-10-11 02:16:44,615 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:16:45,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=22.5 2023-10-11 02:16:51,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=557666.6666666666, ans=0.09899494936611666 2023-10-11 02:17:04,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.705e+02 1.914e+02 2.207e+02 3.319e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 02:17:24,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=557760.0, ans=0.125 2023-10-11 02:17:26,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=12.0 2023-10-11 02:17:27,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=557806.6666666666, ans=0.125 2023-10-11 02:17:37,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=557853.3333333334, ans=0.1 2023-10-11 02:17:45,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=557853.3333333334, ans=0.125 2023-10-11 02:18:10,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=557946.6666666666, ans=0.05 2023-10-11 02:18:21,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-10-11 02:18:36,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=558086.6666666666, ans=0.0 2023-10-11 02:18:54,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-10-11 02:18:58,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=558133.3333333334, ans=0.07 2023-10-11 02:19:05,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.677e+02 1.837e+02 2.135e+02 2.957e+02, threshold=3.673e+02, percent-clipped=0.0 2023-10-11 02:19:44,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=558320.0, ans=0.04949747468305833 2023-10-11 02:20:00,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=558413.3333333334, ans=0.1 2023-10-11 02:20:15,772 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.011e-02 2023-10-11 02:20:28,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=558506.6666666666, ans=0.2 2023-10-11 02:20:43,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=558553.3333333334, ans=0.125 2023-10-11 02:20:52,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=558600.0, ans=0.125 2023-10-11 02:20:54,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=558600.0, ans=0.0 2023-10-11 02:20:59,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=558646.6666666666, ans=0.125 2023-10-11 02:21:01,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.692e+02 1.848e+02 2.088e+02 3.397e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-11 02:21:24,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.98 vs. limit=15.0 2023-10-11 02:21:28,455 INFO [train.py:1031] (3/4) Epoch 9, batch 10500, loss[loss=0.2418, simple_loss=0.3186, pruned_loss=0.08249, over 16486.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2964, pruned_loss=0.06063, over 32650094.94 frames. ], batch size: 266, lr: 3.96e-03, grad_scale: 32.0 2023-10-11 02:21:37,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=558786.6666666666, ans=0.125 2023-10-11 02:21:47,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=558833.3333333334, ans=0.125 2023-10-11 02:22:05,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=558926.6666666666, ans=0.125 2023-10-11 02:22:06,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=558926.6666666666, ans=0.125 2023-10-11 02:22:09,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=558926.6666666666, ans=0.125 2023-10-11 02:22:15,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=558973.3333333334, ans=0.125 2023-10-11 02:22:19,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=558973.3333333334, ans=0.0 2023-10-11 02:22:25,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=559020.0, ans=0.2 2023-10-11 02:22:35,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=559066.6666666666, ans=0.0 2023-10-11 02:22:47,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=559113.3333333334, ans=0.125 2023-10-11 02:22:51,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=559113.3333333334, ans=0.125 2023-10-11 02:22:52,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=559113.3333333334, ans=0.0 2023-10-11 02:22:53,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.706e+02 1.870e+02 2.179e+02 2.943e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 02:22:58,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=559113.3333333334, ans=0.0 2023-10-11 02:23:06,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=559160.0, ans=0.125 2023-10-11 02:23:08,849 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:23:17,146 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:23:19,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-10-11 02:23:32,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=559253.3333333334, ans=0.125 2023-10-11 02:23:35,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.44 vs. limit=15.0 2023-10-11 02:23:51,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=559346.6666666666, ans=0.0 2023-10-11 02:24:11,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=559440.0, ans=0.0 2023-10-11 02:24:18,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=559440.0, ans=0.0 2023-10-11 02:24:22,088 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:24:41,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=559533.3333333334, ans=0.125 2023-10-11 02:24:44,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=559580.0, ans=0.0 2023-10-11 02:24:48,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.701e+02 1.882e+02 2.092e+02 3.275e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-11 02:24:52,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=559580.0, ans=0.05 2023-10-11 02:24:52,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=559580.0, ans=0.0 2023-10-11 02:25:01,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=559626.6666666666, ans=0.125 2023-10-11 02:25:08,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=559673.3333333334, ans=0.0 2023-10-11 02:25:10,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=559673.3333333334, ans=0.0 2023-10-11 02:25:15,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-10-11 02:25:43,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=559813.3333333334, ans=0.0 2023-10-11 02:25:49,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=559813.3333333334, ans=0.0 2023-10-11 02:26:12,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=559906.6666666666, ans=0.125 2023-10-11 02:26:13,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=559906.6666666666, ans=0.04949747468305833 2023-10-11 02:26:26,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=560000.0, ans=0.125 2023-10-11 02:26:26,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=560000.0, ans=0.0 2023-10-11 02:26:29,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2023-10-11 02:26:40,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=560000.0, ans=0.0 2023-10-11 02:26:44,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=560046.6666666666, ans=15.0 2023-10-11 02:26:46,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.235e+02 1.759e+02 2.001e+02 2.384e+02 4.309e+02, threshold=4.001e+02, percent-clipped=4.0 2023-10-11 02:27:21,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=560186.6666666666, ans=0.125 2023-10-11 02:27:24,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.46 vs. limit=15.0 2023-10-11 02:27:48,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.06 vs. limit=12.0 2023-10-11 02:27:56,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.46 vs. limit=15.0 2023-10-11 02:28:01,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.61 vs. limit=15.0 2023-10-11 02:28:40,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.644e+02 1.831e+02 2.037e+02 3.139e+02, threshold=3.662e+02, percent-clipped=0.0 2023-10-11 02:28:50,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=560560.0, ans=0.125 2023-10-11 02:29:20,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=560700.0, ans=0.0 2023-10-11 02:29:23,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=560700.0, ans=0.035 2023-10-11 02:29:54,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=560840.0, ans=0.2 2023-10-11 02:29:59,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=560840.0, ans=0.125 2023-10-11 02:30:20,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.40 vs. limit=15.0 2023-10-11 02:30:29,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.628e+02 1.861e+02 2.137e+02 3.350e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-11 02:30:29,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=560980.0, ans=0.0 2023-10-11 02:30:58,632 INFO [train.py:1031] (3/4) Epoch 9, batch 11000, loss[loss=0.2138, simple_loss=0.3099, pruned_loss=0.05886, over 16686.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2963, pruned_loss=0.06062, over 32651658.76 frames. ], batch size: 202, lr: 3.95e-03, grad_scale: 32.0 2023-10-11 02:30:59,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.03 vs. limit=15.0 2023-10-11 02:31:07,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=561120.0, ans=0.125 2023-10-11 02:31:15,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=561166.6666666666, ans=0.1 2023-10-11 02:31:24,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=561213.3333333334, ans=0.125 2023-10-11 02:31:25,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=561213.3333333334, ans=0.125 2023-10-11 02:31:32,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=15.0 2023-10-11 02:31:38,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=561260.0, ans=0.0 2023-10-11 02:31:44,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-11 02:31:53,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=561353.3333333334, ans=0.0 2023-10-11 02:31:59,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=561353.3333333334, ans=0.125 2023-10-11 02:32:10,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=561400.0, ans=0.0 2023-10-11 02:32:19,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=561446.6666666666, ans=0.125 2023-10-11 02:32:24,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.842e+02 2.021e+02 2.298e+02 3.997e+02, threshold=4.043e+02, percent-clipped=1.0 2023-10-11 02:32:24,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=561446.6666666666, ans=0.1 2023-10-11 02:32:34,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=561493.3333333334, ans=0.07 2023-10-11 02:32:35,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=561493.3333333334, ans=0.0 2023-10-11 02:32:46,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=561540.0, ans=0.125 2023-10-11 02:32:55,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=561586.6666666666, ans=0.0 2023-10-11 02:33:13,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=561633.3333333334, ans=0.0 2023-10-11 02:33:16,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=561633.3333333334, ans=0.125 2023-10-11 02:33:22,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=561680.0, ans=0.1 2023-10-11 02:33:35,671 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2023-10-11 02:33:40,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=561726.6666666666, ans=0.0 2023-10-11 02:33:47,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=561773.3333333334, ans=0.2 2023-10-11 02:34:01,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-10-11 02:34:21,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=561913.3333333334, ans=0.1 2023-10-11 02:34:25,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.688e+02 1.877e+02 2.286e+02 3.190e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-11 02:34:35,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=561960.0, ans=0.1 2023-10-11 02:34:42,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=562006.6666666666, ans=0.125 2023-10-11 02:34:50,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=562006.6666666666, ans=0.0 2023-10-11 02:35:02,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2023-10-11 02:35:34,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=562193.3333333334, ans=0.04949747468305833 2023-10-11 02:35:39,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=562240.0, ans=0.125 2023-10-11 02:35:39,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=562240.0, ans=0.2 2023-10-11 02:35:42,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=562240.0, ans=0.2 2023-10-11 02:35:49,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=562286.6666666666, ans=0.125 2023-10-11 02:36:17,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.633e+02 1.767e+02 1.977e+02 2.630e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-11 02:36:28,780 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:36:57,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=562520.0, ans=0.125 2023-10-11 02:36:57,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=562520.0, ans=0.125 2023-10-11 02:37:03,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=562566.6666666666, ans=0.05 2023-10-11 02:37:45,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=562706.6666666666, ans=0.1 2023-10-11 02:37:59,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=562800.0, ans=0.125 2023-10-11 02:38:06,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=562800.0, ans=0.0 2023-10-11 02:38:12,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.660e+02 1.855e+02 2.054e+02 3.168e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-11 02:38:28,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=562940.0, ans=0.0 2023-10-11 02:38:31,604 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:38:37,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=562940.0, ans=0.0 2023-10-11 02:38:56,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=15.0 2023-10-11 02:39:52,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=563266.6666666666, ans=0.0 2023-10-11 02:39:54,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=563266.6666666666, ans=0.0 2023-10-11 02:39:54,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=563266.6666666666, ans=0.0 2023-10-11 02:40:03,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=563313.3333333334, ans=0.1 2023-10-11 02:40:04,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.783e+02 1.986e+02 2.246e+02 3.484e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-11 02:40:05,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=563313.3333333334, ans=0.04949747468305833 2023-10-11 02:40:17,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=563360.0, ans=0.125 2023-10-11 02:40:32,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2023-10-11 02:40:33,074 INFO [train.py:1031] (3/4) Epoch 9, batch 11500, loss[loss=0.1915, simple_loss=0.2887, pruned_loss=0.04711, over 16861.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2961, pruned_loss=0.06061, over 32679747.40 frames. ], batch size: 98, lr: 3.95e-03, grad_scale: 32.0 2023-10-11 02:40:34,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=22.5 2023-10-11 02:40:37,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=563453.3333333334, ans=0.1 2023-10-11 02:41:16,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=563640.0, ans=0.0 2023-10-11 02:41:26,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=563640.0, ans=0.0 2023-10-11 02:41:56,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.711e+02 1.934e+02 2.141e+02 2.814e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-11 02:41:58,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0 2023-10-11 02:42:28,990 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:42:29,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-10-11 02:42:42,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=563966.6666666666, ans=0.0 2023-10-11 02:42:43,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.97 vs. limit=15.0 2023-10-11 02:43:08,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=564060.0, ans=0.125 2023-10-11 02:43:23,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=564106.6666666666, ans=0.125 2023-10-11 02:43:44,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=564200.0, ans=0.125 2023-10-11 02:43:51,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.630e+02 1.759e+02 2.031e+02 2.945e+02, threshold=3.519e+02, percent-clipped=0.0 2023-10-11 02:43:52,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=564246.6666666666, ans=0.2 2023-10-11 02:44:07,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=564340.0, ans=0.1 2023-10-11 02:44:14,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=564340.0, ans=0.1 2023-10-11 02:44:16,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=564340.0, ans=0.1 2023-10-11 02:44:21,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=564386.6666666666, ans=0.0 2023-10-11 02:45:18,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=564620.0, ans=0.125 2023-10-11 02:45:54,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.255e+02 1.641e+02 1.849e+02 2.148e+02 2.920e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 02:45:54,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=564713.3333333334, ans=0.09899494936611666 2023-10-11 02:45:54,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=564713.3333333334, ans=0.02 2023-10-11 02:45:58,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.19 vs. limit=15.0 2023-10-11 02:46:00,850 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:46:06,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=564760.0, ans=0.0 2023-10-11 02:46:08,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=564760.0, ans=0.0 2023-10-11 02:46:13,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=564806.6666666666, ans=0.2 2023-10-11 02:46:18,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=564806.6666666666, ans=0.2 2023-10-11 02:46:28,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=564853.3333333334, ans=0.125 2023-10-11 02:47:07,580 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-10-11 02:47:27,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=565086.6666666666, ans=0.125 2023-10-11 02:47:52,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.667e+02 1.867e+02 2.069e+02 2.411e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 02:47:57,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=565180.0, ans=0.0 2023-10-11 02:48:23,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=565273.3333333334, ans=0.125 2023-10-11 02:48:24,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=565273.3333333334, ans=0.1 2023-10-11 02:48:31,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=565320.0, ans=0.0 2023-10-11 02:48:41,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2023-10-11 02:48:43,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=565366.6666666666, ans=0.125 2023-10-11 02:48:51,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=565413.3333333334, ans=0.2 2023-10-11 02:48:55,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=565413.3333333334, ans=0.125 2023-10-11 02:48:59,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=565460.0, ans=0.125 2023-10-11 02:48:59,588 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:49:10,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-10-11 02:49:53,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.671e+02 1.825e+02 2.037e+02 2.675e+02, threshold=3.650e+02, percent-clipped=0.0 2023-10-11 02:50:06,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=565693.3333333334, ans=0.1 2023-10-11 02:50:09,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=565740.0, ans=0.2 2023-10-11 02:50:22,702 INFO [train.py:1031] (3/4) Epoch 9, batch 12000, loss[loss=0.2369, simple_loss=0.3182, pruned_loss=0.07785, over 16839.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2961, pruned_loss=0.06022, over 32724799.36 frames. ], batch size: 155, lr: 3.94e-03, grad_scale: 32.0 2023-10-11 02:50:39,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=565833.3333333334, ans=0.0 2023-10-11 02:50:40,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.71 vs. limit=10.0 2023-10-11 02:50:49,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=565880.0, ans=0.0 2023-10-11 02:51:08,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=565973.3333333334, ans=0.125 2023-10-11 02:51:16,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=565973.3333333334, ans=0.0 2023-10-11 02:51:22,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-10-11 02:51:25,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566020.0, ans=0.1 2023-10-11 02:51:46,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=566113.3333333334, ans=0.125 2023-10-11 02:51:49,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.662e+02 1.863e+02 2.149e+02 3.184e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 02:51:53,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=22.5 2023-10-11 02:51:57,849 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:52:06,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-10-11 02:52:08,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=566206.6666666666, ans=0.0 2023-10-11 02:52:15,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=566206.6666666666, ans=0.2 2023-10-11 02:52:28,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=566300.0, ans=0.0 2023-10-11 02:52:32,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=566300.0, ans=0.0 2023-10-11 02:52:40,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-11 02:52:53,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=566393.3333333334, ans=0.0 2023-10-11 02:52:55,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=566393.3333333334, ans=0.2 2023-10-11 02:52:57,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=566393.3333333334, ans=0.125 2023-10-11 02:53:13,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=566486.6666666666, ans=0.125 2023-10-11 02:53:15,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=566486.6666666666, ans=0.125 2023-10-11 02:53:36,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.622e+02 1.807e+02 2.087e+02 2.790e+02, threshold=3.613e+02, percent-clipped=0.0 2023-10-11 02:53:44,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=566626.6666666666, ans=0.125 2023-10-11 02:53:45,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=566626.6666666666, ans=0.05 2023-10-11 02:53:49,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566626.6666666666, ans=0.1 2023-10-11 02:54:16,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=566766.6666666666, ans=0.025 2023-10-11 02:54:27,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566766.6666666666, ans=0.1 2023-10-11 02:54:37,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=566813.3333333334, ans=0.0 2023-10-11 02:54:39,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=566860.0, ans=0.125 2023-10-11 02:54:41,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=566860.0, ans=0.125 2023-10-11 02:55:04,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=566953.3333333334, ans=0.0 2023-10-11 02:55:12,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.24 vs. limit=15.0 2023-10-11 02:55:23,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=567046.6666666666, ans=0.0 2023-10-11 02:55:29,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.705e+02 1.875e+02 2.192e+02 3.255e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 02:55:47,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-11 02:56:03,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=567186.6666666666, ans=0.125 2023-10-11 02:56:06,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=567186.6666666666, ans=0.1 2023-10-11 02:56:22,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=567280.0, ans=0.125 2023-10-11 02:56:22,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=567280.0, ans=0.02 2023-10-11 02:56:25,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=567280.0, ans=0.07 2023-10-11 02:56:27,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=567280.0, ans=0.125 2023-10-11 02:56:27,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=567280.0, ans=10.0 2023-10-11 02:56:42,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=567326.6666666666, ans=0.0 2023-10-11 02:56:43,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=567373.3333333334, ans=0.125 2023-10-11 02:57:19,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=567466.6666666666, ans=0.0 2023-10-11 02:57:29,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.718e+02 1.933e+02 2.276e+02 3.451e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-11 02:58:17,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=567700.0, ans=0.125 2023-10-11 02:58:24,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=567746.6666666666, ans=0.125 2023-10-11 02:58:50,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=567840.0, ans=0.5 2023-10-11 02:58:51,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=567840.0, ans=0.125 2023-10-11 02:59:25,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.736e+02 1.910e+02 2.290e+02 4.074e+02, threshold=3.820e+02, percent-clipped=1.0 2023-10-11 02:59:52,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=12.0 2023-10-11 02:59:54,651 INFO [train.py:1031] (3/4) Epoch 9, batch 12500, loss[loss=0.2091, simple_loss=0.3008, pruned_loss=0.0587, over 16896.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2957, pruned_loss=0.06022, over 32725336.11 frames. ], batch size: 93, lr: 3.93e-03, grad_scale: 32.0 2023-10-11 03:00:02,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-10-11 03:00:02,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=568120.0, ans=0.125 2023-10-11 03:00:14,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=568166.6666666666, ans=0.0 2023-10-11 03:00:16,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=568166.6666666666, ans=0.2 2023-10-11 03:00:34,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=15.0 2023-10-11 03:00:41,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=568306.6666666666, ans=0.125 2023-10-11 03:00:59,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=568400.0, ans=0.2 2023-10-11 03:01:00,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=568400.0, ans=0.125 2023-10-11 03:01:01,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.30 vs. limit=15.0 2023-10-11 03:01:06,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=568400.0, ans=0.125 2023-10-11 03:01:10,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.14 vs. limit=15.0 2023-10-11 03:01:11,746 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:01:15,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.718e+02 1.944e+02 2.290e+02 3.137e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-11 03:01:26,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-10-11 03:01:28,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=568493.3333333334, ans=0.0 2023-10-11 03:01:35,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568540.0, ans=0.1 2023-10-11 03:01:36,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=568540.0, ans=0.2 2023-10-11 03:01:36,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=568540.0, ans=0.125 2023-10-11 03:01:39,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.80 vs. limit=15.0 2023-10-11 03:01:43,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=568586.6666666666, ans=0.0 2023-10-11 03:02:08,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-10-11 03:02:13,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=568680.0, ans=0.2 2023-10-11 03:02:15,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=568680.0, ans=0.0 2023-10-11 03:02:20,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=568726.6666666666, ans=0.125 2023-10-11 03:02:22,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.13 vs. limit=6.0 2023-10-11 03:02:26,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.16 vs. limit=22.5 2023-10-11 03:02:31,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=568773.3333333334, ans=0.125 2023-10-11 03:02:45,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=568820.0, ans=0.2 2023-10-11 03:02:51,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=568866.6666666666, ans=0.0 2023-10-11 03:03:08,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.279e+02 1.590e+02 1.783e+02 2.035e+02 3.237e+02, threshold=3.565e+02, percent-clipped=0.0 2023-10-11 03:03:14,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568960.0, ans=0.1 2023-10-11 03:03:44,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=569053.3333333334, ans=0.125 2023-10-11 03:03:57,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=569100.0, ans=0.0 2023-10-11 03:04:03,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=569146.6666666666, ans=0.0 2023-10-11 03:04:04,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=569146.6666666666, ans=0.125 2023-10-11 03:04:08,904 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:04:26,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-10-11 03:04:35,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-10-11 03:04:38,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=22.5 2023-10-11 03:04:55,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=569380.0, ans=0.015 2023-10-11 03:05:01,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.690e+02 1.800e+02 2.047e+02 3.021e+02, threshold=3.599e+02, percent-clipped=0.0 2023-10-11 03:05:09,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=8.0 2023-10-11 03:05:24,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=569473.3333333334, ans=0.125 2023-10-11 03:05:32,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=569520.0, ans=0.0 2023-10-11 03:05:45,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=569566.6666666666, ans=0.2 2023-10-11 03:05:50,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=569613.3333333334, ans=0.0 2023-10-11 03:05:56,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-11 03:06:21,189 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-10-11 03:06:35,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=569753.3333333334, ans=0.125 2023-10-11 03:06:37,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569800.0, ans=0.125 2023-10-11 03:06:42,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=569800.0, ans=0.125 2023-10-11 03:06:43,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=569800.0, ans=0.125 2023-10-11 03:06:55,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.736e+02 1.913e+02 2.154e+02 3.080e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-11 03:07:04,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=569893.3333333334, ans=0.1 2023-10-11 03:07:05,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=569893.3333333334, ans=0.0 2023-10-11 03:07:06,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=569893.3333333334, ans=0.0 2023-10-11 03:07:11,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=569940.0, ans=0.0 2023-10-11 03:07:34,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=570033.3333333334, ans=0.125 2023-10-11 03:07:34,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=570033.3333333334, ans=0.025 2023-10-11 03:07:35,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=570033.3333333334, ans=6.0 2023-10-11 03:07:36,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=570033.3333333334, ans=0.0 2023-10-11 03:07:45,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=22.5 2023-10-11 03:08:06,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=570173.3333333334, ans=0.0 2023-10-11 03:08:14,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570220.0, ans=0.1 2023-10-11 03:08:37,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-10-11 03:08:38,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=570313.3333333334, ans=0.07 2023-10-11 03:08:43,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.622e+02 1.783e+02 1.931e+02 2.500e+02, threshold=3.567e+02, percent-clipped=0.0 2023-10-11 03:08:48,400 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:08:52,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=570360.0, ans=0.125 2023-10-11 03:09:00,545 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:09:09,306 INFO [train.py:1031] (3/4) Epoch 9, batch 13000, loss[loss=0.2089, simple_loss=0.2918, pruned_loss=0.06303, over 16905.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2967, pruned_loss=0.06047, over 32765831.45 frames. ], batch size: 138, lr: 3.92e-03, grad_scale: 32.0 2023-10-11 03:09:17,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=570453.3333333334, ans=0.0 2023-10-11 03:09:18,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=570500.0, ans=0.04949747468305833 2023-10-11 03:09:27,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=570500.0, ans=0.0 2023-10-11 03:09:43,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=570546.6666666666, ans=0.125 2023-10-11 03:09:43,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.67 vs. limit=15.0 2023-10-11 03:10:06,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=570640.0, ans=0.125 2023-10-11 03:10:06,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-10-11 03:10:11,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=570640.0, ans=0.125 2023-10-11 03:10:19,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=570686.6666666666, ans=0.2 2023-10-11 03:10:23,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=570686.6666666666, ans=0.0 2023-10-11 03:10:24,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=570733.3333333334, ans=0.125 2023-10-11 03:10:45,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.699e+02 1.852e+02 2.104e+02 3.732e+02, threshold=3.704e+02, percent-clipped=1.0 2023-10-11 03:10:49,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=570826.6666666666, ans=0.125 2023-10-11 03:11:00,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=570873.3333333334, ans=0.025 2023-10-11 03:11:21,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-11 03:11:32,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=571013.3333333334, ans=0.125 2023-10-11 03:11:45,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=571060.0, ans=0.125 2023-10-11 03:11:52,835 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:11:57,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=571106.6666666666, ans=0.0 2023-10-11 03:12:05,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=571106.6666666666, ans=0.2 2023-10-11 03:12:22,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-10-11 03:12:27,048 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-10-11 03:12:38,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.698e+02 1.954e+02 2.242e+02 3.172e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-11 03:12:59,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=571340.0, ans=0.2 2023-10-11 03:13:00,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=571340.0, ans=0.09899494936611666 2023-10-11 03:13:06,403 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.84 vs. limit=15.0 2023-10-11 03:13:09,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.89 vs. limit=15.0 2023-10-11 03:13:30,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2023-10-11 03:13:34,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.81 vs. limit=22.5 2023-10-11 03:13:41,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=571526.6666666666, ans=0.125 2023-10-11 03:13:46,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=571526.6666666666, ans=0.125 2023-10-11 03:13:51,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.88 vs. limit=22.5 2023-10-11 03:14:10,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=571620.0, ans=0.2 2023-10-11 03:14:13,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=571666.6666666666, ans=0.125 2023-10-11 03:14:23,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=571666.6666666666, ans=0.1 2023-10-11 03:14:30,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=571713.3333333334, ans=0.2 2023-10-11 03:14:31,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=571713.3333333334, ans=0.0 2023-10-11 03:14:35,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.623e+02 1.776e+02 1.937e+02 2.839e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-11 03:14:37,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=571760.0, ans=0.125 2023-10-11 03:14:45,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=571760.0, ans=0.1 2023-10-11 03:14:55,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-10-11 03:15:14,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=571900.0, ans=0.2 2023-10-11 03:15:18,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=571900.0, ans=0.0 2023-10-11 03:15:28,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=571946.6666666666, ans=0.0 2023-10-11 03:15:30,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=571946.6666666666, ans=0.2 2023-10-11 03:15:38,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=571993.3333333334, ans=0.0 2023-10-11 03:15:56,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=572086.6666666666, ans=0.125 2023-10-11 03:16:05,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=572133.3333333334, ans=0.125 2023-10-11 03:16:06,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=10.0 2023-10-11 03:16:13,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=572133.3333333334, ans=0.1 2023-10-11 03:16:21,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=572180.0, ans=0.0 2023-10-11 03:16:24,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.698e+02 1.866e+02 2.146e+02 3.402e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 03:16:33,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-10-11 03:16:55,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=22.5 2023-10-11 03:16:59,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=572320.0, ans=0.025 2023-10-11 03:17:05,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=572366.6666666666, ans=0.0 2023-10-11 03:17:17,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-11 03:17:19,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=572413.3333333334, ans=0.02 2023-10-11 03:17:34,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=572460.0, ans=0.125 2023-10-11 03:17:38,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=572506.6666666666, ans=0.0 2023-10-11 03:17:55,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=572553.3333333334, ans=0.2 2023-10-11 03:17:57,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=572600.0, ans=0.125 2023-10-11 03:18:00,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=572600.0, ans=0.125 2023-10-11 03:18:06,017 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:18:18,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.622e+02 1.866e+02 2.124e+02 3.273e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 03:18:24,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=572693.3333333334, ans=0.125 2023-10-11 03:18:30,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572740.0, ans=0.1 2023-10-11 03:18:32,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2023-10-11 03:18:41,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=572786.6666666666, ans=0.125 2023-10-11 03:18:41,627 INFO [train.py:1031] (3/4) Epoch 9, batch 13500, loss[loss=0.1969, simple_loss=0.2885, pruned_loss=0.05265, over 16817.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2959, pruned_loss=0.0602, over 32773338.40 frames. ], batch size: 98, lr: 3.91e-03, grad_scale: 16.0 2023-10-11 03:18:54,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=572833.3333333334, ans=0.125 2023-10-11 03:19:00,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572833.3333333334, ans=0.1 2023-10-11 03:19:03,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=572880.0, ans=0.125 2023-10-11 03:19:10,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=12.0 2023-10-11 03:19:14,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-10-11 03:19:14,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=15.0 2023-10-11 03:19:19,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=572926.6666666666, ans=0.1 2023-10-11 03:19:30,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=572973.3333333334, ans=0.09899494936611666 2023-10-11 03:19:39,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=573020.0, ans=0.125 2023-10-11 03:20:10,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.631e+02 1.771e+02 2.012e+02 3.098e+02, threshold=3.542e+02, percent-clipped=0.0 2023-10-11 03:20:13,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2023-10-11 03:20:17,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.82 vs. limit=22.5 2023-10-11 03:20:55,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=573346.6666666666, ans=0.0 2023-10-11 03:21:05,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=573393.3333333334, ans=0.125 2023-10-11 03:21:14,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=573440.0, ans=0.2 2023-10-11 03:21:56,105 INFO [train.py:1031] (3/4) Epoch 10, batch 0, loss[loss=0.2008, simple_loss=0.2833, pruned_loss=0.05915, over 16849.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2833, pruned_loss=0.05915, over 16849.00 frames. ], batch size: 175, lr: 3.69e-03, grad_scale: 32.0 2023-10-11 03:21:56,106 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-11 03:22:04,444 INFO [train.py:1063] (3/4) Epoch 10, validation: loss=0.221, simple_loss=0.3086, pruned_loss=0.06676, over 1020973.00 frames. 2023-10-11 03:22:04,444 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-11 03:22:18,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=573556.6666666666, ans=0.125 2023-10-11 03:22:25,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=573556.6666666666, ans=0.09899494936611666 2023-10-11 03:22:29,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=8.0 2023-10-11 03:22:33,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.797e+02 1.975e+02 2.272e+02 3.905e+02, threshold=3.949e+02, percent-clipped=2.0 2023-10-11 03:22:35,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=573603.3333333334, ans=0.1 2023-10-11 03:22:36,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=573603.3333333334, ans=0.0 2023-10-11 03:22:57,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=573696.6666666666, ans=0.0 2023-10-11 03:23:21,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=573790.0, ans=0.125 2023-10-11 03:23:26,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=573790.0, ans=0.125 2023-10-11 03:23:26,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=573790.0, ans=0.125 2023-10-11 03:23:26,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=573790.0, ans=0.125 2023-10-11 03:23:33,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=573836.6666666666, ans=0.125 2023-10-11 03:23:49,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=573883.3333333334, ans=0.1 2023-10-11 03:24:01,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=573930.0, ans=0.2 2023-10-11 03:24:05,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=15.0 2023-10-11 03:24:30,216 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.587e+02 1.689e+02 1.848e+02 2.696e+02, threshold=3.378e+02, percent-clipped=0.0 2023-10-11 03:24:35,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-10-11 03:24:47,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=574116.6666666666, ans=0.125 2023-10-11 03:25:03,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=574210.0, ans=0.125 2023-10-11 03:25:09,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-10-11 03:25:13,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=574256.6666666666, ans=0.125 2023-10-11 03:25:14,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=574256.6666666666, ans=0.0 2023-10-11 03:25:26,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-10-11 03:25:57,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=574443.3333333334, ans=0.07 2023-10-11 03:25:58,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=574443.3333333334, ans=0.0 2023-10-11 03:26:02,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=22.5 2023-10-11 03:26:12,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=574490.0, ans=0.2 2023-10-11 03:26:12,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=574490.0, ans=0.125 2023-10-11 03:26:20,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.692e+02 1.849e+02 2.100e+02 3.049e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 03:26:21,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=574536.6666666666, ans=0.125 2023-10-11 03:26:22,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=574536.6666666666, ans=0.1 2023-10-11 03:26:33,976 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.98 vs. limit=15.0 2023-10-11 03:26:59,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=574676.6666666666, ans=0.04949747468305833 2023-10-11 03:27:01,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=574676.6666666666, ans=0.1 2023-10-11 03:27:04,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=574676.6666666666, ans=0.125 2023-10-11 03:27:06,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=574723.3333333334, ans=0.1 2023-10-11 03:27:17,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=574770.0, ans=0.0 2023-10-11 03:27:25,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=574770.0, ans=0.125 2023-10-11 03:27:30,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=574816.6666666666, ans=0.1 2023-10-11 03:27:43,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=574863.3333333334, ans=0.0 2023-10-11 03:27:45,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=574863.3333333334, ans=0.0 2023-10-11 03:27:53,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-11 03:28:07,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=574956.6666666666, ans=0.125 2023-10-11 03:28:13,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=575003.3333333334, ans=0.0 2023-10-11 03:28:17,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.682e+02 1.839e+02 2.067e+02 3.304e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-11 03:28:18,123 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-10-11 03:28:20,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=575003.3333333334, ans=0.125 2023-10-11 03:28:24,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.01 vs. limit=22.5 2023-10-11 03:28:32,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=575050.0, ans=0.125 2023-10-11 03:28:53,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=575143.3333333334, ans=0.1 2023-10-11 03:29:03,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=575190.0, ans=0.125 2023-10-11 03:29:05,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=575190.0, ans=0.1 2023-10-11 03:29:08,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=575190.0, ans=0.1 2023-10-11 03:29:16,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=575236.6666666666, ans=0.0 2023-10-11 03:29:38,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=575330.0, ans=0.125 2023-10-11 03:30:00,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=575423.3333333334, ans=0.0 2023-10-11 03:30:05,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=575470.0, ans=0.125 2023-10-11 03:30:06,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=575470.0, ans=0.1 2023-10-11 03:30:06,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=575470.0, ans=0.0 2023-10-11 03:30:08,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.803e+02 1.967e+02 2.167e+02 3.200e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-11 03:30:10,821 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:30:36,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-10-11 03:31:08,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.97 vs. limit=15.0 2023-10-11 03:31:29,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=575796.6666666666, ans=0.2 2023-10-11 03:31:35,843 INFO [train.py:1031] (3/4) Epoch 10, batch 500, loss[loss=0.1943, simple_loss=0.2811, pruned_loss=0.05375, over 16644.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2958, pruned_loss=0.06012, over 7282621.04 frames. ], batch size: 241, lr: 3.68e-03, grad_scale: 32.0 2023-10-11 03:31:52,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-10-11 03:32:02,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.738e+02 1.898e+02 2.105e+02 2.822e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 03:32:02,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=575936.6666666666, ans=0.04949747468305833 2023-10-11 03:32:07,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=15.0 2023-10-11 03:32:15,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.21 vs. limit=22.5 2023-10-11 03:32:21,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=576030.0, ans=0.0 2023-10-11 03:32:24,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=22.5 2023-10-11 03:32:26,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=576030.0, ans=0.125 2023-10-11 03:32:27,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=576030.0, ans=0.0 2023-10-11 03:32:27,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.12 vs. limit=15.0 2023-10-11 03:32:46,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=576123.3333333334, ans=0.125 2023-10-11 03:32:56,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=576170.0, ans=0.125 2023-10-11 03:32:57,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=576170.0, ans=0.2 2023-10-11 03:33:02,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=576170.0, ans=0.125 2023-10-11 03:33:42,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=576356.6666666666, ans=0.125 2023-10-11 03:33:52,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.743e+02 1.919e+02 2.261e+02 3.270e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-11 03:34:05,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=576450.0, ans=0.125 2023-10-11 03:34:10,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=576450.0, ans=0.1 2023-10-11 03:34:31,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=576543.3333333334, ans=0.0 2023-10-11 03:34:33,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=576590.0, ans=0.125 2023-10-11 03:34:55,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-10-11 03:34:56,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=576683.3333333334, ans=0.125 2023-10-11 03:35:01,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=576683.3333333334, ans=0.0 2023-10-11 03:35:24,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=576776.6666666666, ans=0.0 2023-10-11 03:35:27,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=576823.3333333334, ans=0.0 2023-10-11 03:35:38,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-10-11 03:35:41,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.830e+02 2.020e+02 2.223e+02 3.472e+02, threshold=4.041e+02, percent-clipped=0.0 2023-10-11 03:35:41,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=576870.0, ans=0.2 2023-10-11 03:35:53,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=576916.6666666666, ans=0.0 2023-10-11 03:36:00,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=576916.6666666666, ans=0.125 2023-10-11 03:36:05,953 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:36:15,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=577010.0, ans=0.0 2023-10-11 03:36:18,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=577010.0, ans=0.125 2023-10-11 03:36:41,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=577103.3333333334, ans=0.1 2023-10-11 03:36:47,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=12.0 2023-10-11 03:37:16,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=577243.3333333334, ans=0.1 2023-10-11 03:37:36,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.726e+02 1.900e+02 2.168e+02 3.407e+02, threshold=3.799e+02, percent-clipped=0.0 2023-10-11 03:37:47,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=577383.3333333334, ans=0.2 2023-10-11 03:38:06,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=577430.0, ans=0.2 2023-10-11 03:38:18,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=577476.6666666666, ans=15.0 2023-10-11 03:38:49,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=577616.6666666666, ans=0.125 2023-10-11 03:38:49,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-10-11 03:38:54,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=577616.6666666666, ans=0.0 2023-10-11 03:38:58,466 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:39:05,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=577663.3333333334, ans=0.125 2023-10-11 03:39:07,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-10-11 03:39:07,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=577710.0, ans=0.125 2023-10-11 03:39:19,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.46 vs. limit=10.0 2023-10-11 03:39:31,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=577803.3333333334, ans=0.1 2023-10-11 03:39:32,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.731e+02 1.961e+02 2.357e+02 3.371e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-11 03:39:36,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=577803.3333333334, ans=0.0 2023-10-11 03:39:42,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=577850.0, ans=0.0 2023-10-11 03:39:43,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-11 03:40:20,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=577990.0, ans=0.125 2023-10-11 03:40:34,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578036.6666666666, ans=0.1 2023-10-11 03:40:49,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2023-10-11 03:40:59,206 INFO [train.py:1031] (3/4) Epoch 10, batch 1000, loss[loss=0.2093, simple_loss=0.2951, pruned_loss=0.0618, over 16540.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2967, pruned_loss=0.06082, over 12909210.05 frames. ], batch size: 50, lr: 3.68e-03, grad_scale: 32.0 2023-10-11 03:41:02,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-10-11 03:41:06,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.19 vs. limit=15.0 2023-10-11 03:41:07,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2023-10-11 03:41:11,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=578223.3333333334, ans=0.0 2023-10-11 03:41:19,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578270.0, ans=0.1 2023-10-11 03:41:23,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.625e+02 1.755e+02 1.937e+02 2.663e+02, threshold=3.510e+02, percent-clipped=0.0 2023-10-11 03:41:24,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=12.0 2023-10-11 03:41:27,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=578270.0, ans=0.125 2023-10-11 03:41:34,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=578316.6666666666, ans=0.2 2023-10-11 03:41:39,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=578316.6666666666, ans=0.0 2023-10-11 03:41:43,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=578363.3333333334, ans=0.0 2023-10-11 03:41:44,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=578363.3333333334, ans=0.125 2023-10-11 03:41:44,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-10-11 03:42:04,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=578456.6666666666, ans=0.07 2023-10-11 03:42:09,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578456.6666666666, ans=0.1 2023-10-11 03:42:17,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=578503.3333333334, ans=15.0 2023-10-11 03:42:21,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=578503.3333333334, ans=0.125 2023-10-11 03:42:27,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=578550.0, ans=0.0 2023-10-11 03:42:34,815 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-10-11 03:42:51,083 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:43:00,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.35 vs. limit=6.0 2023-10-11 03:43:05,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=578690.0, ans=0.0 2023-10-11 03:43:15,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.799e+02 2.083e+02 2.378e+02 3.396e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-11 03:43:23,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=578783.3333333334, ans=0.2 2023-10-11 03:43:26,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.57 vs. limit=15.0 2023-10-11 03:43:33,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=578830.0, ans=0.125 2023-10-11 03:43:49,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=578876.6666666666, ans=0.125 2023-10-11 03:44:14,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=578970.0, ans=0.125 2023-10-11 03:44:36,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=579016.6666666666, ans=0.1 2023-10-11 03:44:39,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=579063.3333333334, ans=0.125 2023-10-11 03:45:14,388 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:45:16,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.252e+02 1.588e+02 1.788e+02 2.035e+02 3.161e+02, threshold=3.576e+02, percent-clipped=0.0 2023-10-11 03:45:21,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=579203.3333333334, ans=0.125 2023-10-11 03:45:24,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-10-11 03:45:44,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=579296.6666666666, ans=0.2 2023-10-11 03:45:59,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579390.0, ans=0.1 2023-10-11 03:46:00,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=579390.0, ans=0.0 2023-10-11 03:46:03,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=579390.0, ans=0.125 2023-10-11 03:46:18,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=579483.3333333334, ans=0.0 2023-10-11 03:46:33,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=579530.0, ans=0.125 2023-10-11 03:46:35,680 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:46:44,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.88 vs. limit=22.5 2023-10-11 03:46:47,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=579576.6666666666, ans=0.2 2023-10-11 03:46:55,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-10-11 03:47:06,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.715e+02 1.931e+02 2.139e+02 3.229e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 03:47:15,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=579716.6666666666, ans=0.5 2023-10-11 03:47:26,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=579763.3333333334, ans=0.125 2023-10-11 03:47:33,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=579810.0, ans=0.125 2023-10-11 03:47:40,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=579810.0, ans=0.125 2023-10-11 03:47:48,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-10-11 03:47:59,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=579903.3333333334, ans=0.125 2023-10-11 03:48:33,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=580043.3333333334, ans=0.125 2023-10-11 03:48:55,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.736e+02 1.988e+02 2.305e+02 3.684e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-11 03:48:59,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=15.0 2023-10-11 03:49:00,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=580136.6666666666, ans=0.1 2023-10-11 03:49:02,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=580136.6666666666, ans=0.125 2023-10-11 03:49:18,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=580230.0, ans=0.125 2023-10-11 03:49:31,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=580276.6666666666, ans=0.0 2023-10-11 03:49:35,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=12.0 2023-10-11 03:49:54,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=580370.0, ans=0.0 2023-10-11 03:50:08,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=580416.6666666666, ans=0.07 2023-10-11 03:50:18,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.27 vs. limit=15.0 2023-10-11 03:50:27,227 INFO [train.py:1031] (3/4) Epoch 10, batch 1500, loss[loss=0.26, simple_loss=0.3198, pruned_loss=0.1001, over 15629.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2949, pruned_loss=0.05985, over 17327899.08 frames. ], batch size: 350, lr: 3.67e-03, grad_scale: 32.0 2023-10-11 03:50:36,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=580510.0, ans=0.125 2023-10-11 03:50:52,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=580603.3333333334, ans=0.125 2023-10-11 03:50:52,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-10-11 03:50:55,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.652e+02 1.830e+02 2.080e+02 3.313e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-11 03:51:12,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=580650.0, ans=0.125 2023-10-11 03:51:16,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2023-10-11 03:51:28,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=580743.3333333334, ans=0.125 2023-10-11 03:51:34,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=580743.3333333334, ans=0.09899494936611666 2023-10-11 03:51:45,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=580790.0, ans=0.125 2023-10-11 03:51:57,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=580836.6666666666, ans=0.2 2023-10-11 03:51:59,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=8.0 2023-10-11 03:51:59,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=580836.6666666666, ans=0.0 2023-10-11 03:52:03,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=580883.3333333334, ans=0.0 2023-10-11 03:52:19,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-10-11 03:52:26,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=580976.6666666666, ans=0.125 2023-10-11 03:52:43,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=581023.3333333334, ans=0.0 2023-10-11 03:52:48,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.656e+02 1.897e+02 2.087e+02 2.611e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-11 03:52:51,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=581070.0, ans=0.125 2023-10-11 03:53:04,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=581116.6666666666, ans=0.0 2023-10-11 03:53:24,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=581163.3333333334, ans=0.1 2023-10-11 03:53:31,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=581210.0, ans=0.0 2023-10-11 03:53:46,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=581256.6666666666, ans=0.0 2023-10-11 03:53:56,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-10-11 03:54:02,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=581350.0, ans=0.2 2023-10-11 03:54:04,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=581350.0, ans=0.0 2023-10-11 03:54:14,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=581396.6666666666, ans=0.1 2023-10-11 03:54:30,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=581490.0, ans=0.125 2023-10-11 03:54:33,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=581490.0, ans=0.125 2023-10-11 03:54:34,156 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:54:41,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581536.6666666666, ans=0.1 2023-10-11 03:54:44,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=581536.6666666666, ans=0.5 2023-10-11 03:54:44,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.719e+02 1.933e+02 2.196e+02 3.215e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-11 03:54:50,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=581536.6666666666, ans=0.0 2023-10-11 03:55:13,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=581676.6666666666, ans=0.125 2023-10-11 03:55:14,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581676.6666666666, ans=0.1 2023-10-11 03:55:24,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.58 vs. limit=22.5 2023-10-11 03:55:28,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=581723.3333333334, ans=0.125 2023-10-11 03:56:24,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-10-11 03:56:25,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.87 vs. limit=22.5 2023-10-11 03:56:33,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-10-11 03:56:40,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.658e+02 1.809e+02 1.985e+02 2.758e+02, threshold=3.618e+02, percent-clipped=0.0 2023-10-11 03:56:47,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582050.0, ans=0.1 2023-10-11 03:57:05,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.81 vs. limit=10.0 2023-10-11 03:57:09,107 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:57:28,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582190.0, ans=0.1 2023-10-11 03:57:38,125 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:57:44,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.79 vs. limit=22.5 2023-10-11 03:57:48,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-10-11 03:57:56,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-11 03:58:12,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=582376.6666666666, ans=0.0 2023-10-11 03:58:18,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.66 vs. limit=15.0 2023-10-11 03:58:24,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=582470.0, ans=0.2 2023-10-11 03:58:26,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=582470.0, ans=0.125 2023-10-11 03:58:27,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.702e+02 1.875e+02 2.183e+02 2.801e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-11 03:58:29,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=582470.0, ans=0.125 2023-10-11 03:58:51,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=582563.3333333334, ans=0.1 2023-10-11 03:59:10,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=582610.0, ans=0.125 2023-10-11 03:59:17,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=22.5 2023-10-11 03:59:19,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=582656.6666666666, ans=0.0 2023-10-11 03:59:46,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=582750.0, ans=0.125 2023-10-11 04:00:03,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=582796.6666666666, ans=0.125 2023-10-11 04:00:10,828 INFO [train.py:1031] (3/4) Epoch 10, batch 2000, loss[loss=0.1927, simple_loss=0.2947, pruned_loss=0.0453, over 16805.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2956, pruned_loss=0.05982, over 20754757.96 frames. ], batch size: 87, lr: 3.66e-03, grad_scale: 64.0 2023-10-11 04:00:13,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.88 vs. limit=22.5 2023-10-11 04:00:23,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=582890.0, ans=0.0 2023-10-11 04:00:33,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=582890.0, ans=0.2 2023-10-11 04:00:39,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=582936.6666666666, ans=0.125 2023-10-11 04:00:39,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.755e+02 1.912e+02 2.307e+02 3.265e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-11 04:00:46,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=582936.6666666666, ans=0.125 2023-10-11 04:01:04,338 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:01:08,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=583030.0, ans=0.2 2023-10-11 04:01:20,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-10-11 04:01:32,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=583123.3333333334, ans=0.0 2023-10-11 04:01:41,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=583170.0, ans=0.125 2023-10-11 04:01:51,325 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:02:26,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=583310.0, ans=0.125 2023-10-11 04:02:32,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=583310.0, ans=0.2 2023-10-11 04:02:49,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=583356.6666666666, ans=0.125 2023-10-11 04:03:04,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.588e+02 1.730e+02 1.939e+02 2.885e+02, threshold=3.461e+02, percent-clipped=0.0 2023-10-11 04:03:05,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=583403.3333333334, ans=0.1 2023-10-11 04:03:24,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=583450.0, ans=0.125 2023-10-11 04:03:29,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.13 vs. limit=6.0 2023-10-11 04:04:01,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=583636.6666666666, ans=0.0 2023-10-11 04:04:12,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=583683.3333333334, ans=0.1 2023-10-11 04:04:23,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=583683.3333333334, ans=0.125 2023-10-11 04:04:39,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=583776.6666666666, ans=0.125 2023-10-11 04:04:44,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=583776.6666666666, ans=0.125 2023-10-11 04:04:53,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=583823.3333333334, ans=0.0 2023-10-11 04:04:54,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583823.3333333334, ans=0.1 2023-10-11 04:04:57,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=583870.0, ans=0.2 2023-10-11 04:05:01,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.720e+02 1.865e+02 2.049e+02 3.044e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-11 04:05:05,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.53 vs. limit=10.0 2023-10-11 04:05:07,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=583870.0, ans=0.0 2023-10-11 04:05:11,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=12.0 2023-10-11 04:05:15,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=583916.6666666666, ans=0.0 2023-10-11 04:05:22,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=583963.3333333334, ans=0.0 2023-10-11 04:05:23,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=583963.3333333334, ans=0.125 2023-10-11 04:05:28,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=583963.3333333334, ans=0.125 2023-10-11 04:05:29,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.86 vs. limit=15.0 2023-10-11 04:05:41,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=584056.6666666666, ans=0.125 2023-10-11 04:05:51,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=584103.3333333334, ans=0.1 2023-10-11 04:06:14,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=584196.6666666666, ans=0.0 2023-10-11 04:06:18,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=584196.6666666666, ans=0.125 2023-10-11 04:06:23,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=584243.3333333334, ans=0.2 2023-10-11 04:06:34,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=584290.0, ans=0.125 2023-10-11 04:06:48,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.14 vs. limit=15.0 2023-10-11 04:06:48,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.723e+02 2.011e+02 2.239e+02 3.501e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-11 04:07:19,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=584476.6666666666, ans=0.0 2023-10-11 04:07:21,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=584476.6666666666, ans=0.125 2023-10-11 04:07:24,419 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.556e-03 2023-10-11 04:07:39,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=584523.3333333334, ans=0.125 2023-10-11 04:07:58,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=584616.6666666666, ans=0.125 2023-10-11 04:08:02,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=584663.3333333334, ans=0.125 2023-10-11 04:08:11,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=584663.3333333334, ans=0.125 2023-10-11 04:08:12,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=8.0 2023-10-11 04:08:22,749 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.00 vs. limit=15.0 2023-10-11 04:08:26,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=584756.6666666666, ans=0.0 2023-10-11 04:08:37,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584803.3333333334, ans=0.1 2023-10-11 04:08:41,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.704e+02 2.016e+02 2.293e+02 3.044e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-11 04:08:46,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=584850.0, ans=0.0 2023-10-11 04:08:46,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.23 vs. limit=22.5 2023-10-11 04:09:08,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-10-11 04:09:20,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584990.0, ans=0.1 2023-10-11 04:09:20,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=584990.0, ans=0.1 2023-10-11 04:09:26,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-10-11 04:09:35,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.00 vs. limit=22.5 2023-10-11 04:09:42,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2023-10-11 04:09:59,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=585130.0, ans=0.125 2023-10-11 04:09:59,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585130.0, ans=0.1 2023-10-11 04:10:02,513 INFO [train.py:1031] (3/4) Epoch 10, batch 2500, loss[loss=0.1836, simple_loss=0.2806, pruned_loss=0.04333, over 16921.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2956, pruned_loss=0.06006, over 23403780.08 frames. ], batch size: 82, lr: 3.65e-03, grad_scale: 32.0 2023-10-11 04:10:22,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=585270.0, ans=0.125 2023-10-11 04:10:23,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=585270.0, ans=0.125 2023-10-11 04:10:27,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.703e+02 1.941e+02 2.217e+02 3.338e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-11 04:10:31,186 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:10:36,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=585316.6666666666, ans=0.0 2023-10-11 04:10:39,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=585316.6666666666, ans=0.2 2023-10-11 04:10:41,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.33 vs. limit=10.0 2023-10-11 04:10:41,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=585316.6666666666, ans=10.0 2023-10-11 04:10:49,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-10-11 04:11:00,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2023-10-11 04:11:52,822 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:11:54,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=15.0 2023-10-11 04:12:09,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=585690.0, ans=0.2 2023-10-11 04:12:16,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=585736.6666666666, ans=0.2 2023-10-11 04:12:17,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.775e+02 1.953e+02 2.193e+02 3.268e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-11 04:12:22,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=585736.6666666666, ans=0.2 2023-10-11 04:12:28,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=585783.3333333334, ans=0.125 2023-10-11 04:12:32,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=585783.3333333334, ans=0.125 2023-10-11 04:12:48,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=585876.6666666666, ans=0.2 2023-10-11 04:12:53,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=585876.6666666666, ans=0.0 2023-10-11 04:12:54,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.05 vs. limit=10.0 2023-10-11 04:12:55,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=585923.3333333334, ans=0.125 2023-10-11 04:13:02,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=585923.3333333334, ans=0.2 2023-10-11 04:13:09,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=585970.0, ans=0.125 2023-10-11 04:13:10,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2023-10-11 04:13:15,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=585970.0, ans=0.125 2023-10-11 04:13:23,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=586016.6666666666, ans=0.125 2023-10-11 04:13:43,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=586110.0, ans=0.0 2023-10-11 04:13:57,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-10-11 04:14:01,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=586156.6666666666, ans=0.125 2023-10-11 04:14:02,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.10 vs. limit=15.0 2023-10-11 04:14:06,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=586203.3333333334, ans=0.1 2023-10-11 04:14:08,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=586203.3333333334, ans=0.0 2023-10-11 04:14:11,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.794e+02 1.980e+02 2.305e+02 3.152e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-11 04:14:19,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=586250.0, ans=0.2 2023-10-11 04:14:29,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586296.6666666666, ans=0.1 2023-10-11 04:14:40,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.04 vs. limit=15.0 2023-10-11 04:14:43,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=586343.3333333334, ans=0.125 2023-10-11 04:14:46,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=586343.3333333334, ans=0.125 2023-10-11 04:14:50,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=586343.3333333334, ans=0.125 2023-10-11 04:15:29,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=586483.3333333334, ans=0.125 2023-10-11 04:15:40,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=586530.0, ans=0.1 2023-10-11 04:16:04,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=586670.0, ans=0.0 2023-10-11 04:16:08,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.242e+02 1.697e+02 1.881e+02 2.195e+02 4.343e+02, threshold=3.761e+02, percent-clipped=2.0 2023-10-11 04:16:30,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-11 04:17:14,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586903.3333333334, ans=0.1 2023-10-11 04:17:16,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=586903.3333333334, ans=0.1 2023-10-11 04:17:27,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=586950.0, ans=0.0 2023-10-11 04:17:33,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=586996.6666666666, ans=0.0 2023-10-11 04:17:33,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.66 vs. limit=22.5 2023-10-11 04:17:39,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=586996.6666666666, ans=0.125 2023-10-11 04:17:40,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586996.6666666666, ans=0.1 2023-10-11 04:17:58,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=587090.0, ans=0.2 2023-10-11 04:18:02,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=587090.0, ans=0.0 2023-10-11 04:18:13,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.671e+02 1.887e+02 2.150e+02 2.980e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-11 04:18:21,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=587183.3333333334, ans=0.1 2023-10-11 04:18:24,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=587183.3333333334, ans=0.125 2023-10-11 04:18:25,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=587183.3333333334, ans=0.0 2023-10-11 04:18:44,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=587276.6666666666, ans=0.125 2023-10-11 04:18:55,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-10-11 04:19:05,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587370.0, ans=0.1 2023-10-11 04:19:21,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=587416.6666666666, ans=0.125 2023-10-11 04:19:32,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=587463.3333333334, ans=0.125 2023-10-11 04:19:34,728 INFO [train.py:1031] (3/4) Epoch 10, batch 3000, loss[loss=0.2069, simple_loss=0.2678, pruned_loss=0.07298, over 12620.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2948, pruned_loss=0.05986, over 25510323.83 frames. ], batch size: 440, lr: 3.65e-03, grad_scale: 32.0 2023-10-11 04:20:01,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.750e+02 2.010e+02 2.236e+02 3.709e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-11 04:20:14,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=587650.0, ans=0.2 2023-10-11 04:20:17,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=587696.6666666666, ans=0.0 2023-10-11 04:20:20,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=587696.6666666666, ans=0.125 2023-10-11 04:20:21,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=587696.6666666666, ans=0.0 2023-10-11 04:20:26,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=587696.6666666666, ans=0.0 2023-10-11 04:20:34,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2023-10-11 04:20:44,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2023-10-11 04:20:48,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.52 vs. limit=22.5 2023-10-11 04:21:06,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=587883.3333333334, ans=0.2 2023-10-11 04:21:13,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=587930.0, ans=0.125 2023-10-11 04:21:27,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=587976.6666666666, ans=0.125 2023-10-11 04:21:41,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588023.3333333334, ans=0.1 2023-10-11 04:21:42,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=588023.3333333334, ans=0.0 2023-10-11 04:21:51,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=588023.3333333334, ans=0.125 2023-10-11 04:21:58,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.718e+02 1.875e+02 2.075e+02 3.232e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 04:22:00,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=588070.0, ans=0.125 2023-10-11 04:22:09,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=588116.6666666666, ans=15.0 2023-10-11 04:22:10,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=588116.6666666666, ans=0.0 2023-10-11 04:22:10,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=588116.6666666666, ans=0.125 2023-10-11 04:22:13,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=588163.3333333334, ans=0.04949747468305833 2023-10-11 04:22:23,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=588210.0, ans=0.025 2023-10-11 04:22:30,526 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:22:30,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=588210.0, ans=0.0 2023-10-11 04:23:11,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=588396.6666666666, ans=0.95 2023-10-11 04:23:17,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=588396.6666666666, ans=0.125 2023-10-11 04:23:31,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-10-11 04:23:32,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=588490.0, ans=0.0 2023-10-11 04:23:36,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-11 04:23:38,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588490.0, ans=0.1 2023-10-11 04:23:44,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=588536.6666666666, ans=0.0 2023-10-11 04:23:49,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.647e+02 1.817e+02 1.993e+02 2.964e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-11 04:24:20,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=588630.0, ans=0.125 2023-10-11 04:24:31,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=588676.6666666666, ans=0.125 2023-10-11 04:25:24,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=588910.0, ans=0.0 2023-10-11 04:25:51,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.756e+02 1.920e+02 2.202e+02 2.872e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-11 04:25:54,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2023-10-11 04:26:03,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.44 vs. limit=22.5 2023-10-11 04:26:11,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-10-11 04:26:12,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=589096.6666666666, ans=0.125 2023-10-11 04:26:22,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589143.3333333334, ans=0.1 2023-10-11 04:26:30,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=589190.0, ans=0.05 2023-10-11 04:26:39,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=589190.0, ans=0.125 2023-10-11 04:26:52,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589236.6666666666, ans=0.1 2023-10-11 04:27:00,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=589283.3333333334, ans=0.125 2023-10-11 04:27:15,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589330.0, ans=0.1 2023-10-11 04:27:28,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=589423.3333333334, ans=0.125 2023-10-11 04:27:47,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.679e+02 1.845e+02 2.040e+02 3.311e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 04:27:50,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=589470.0, ans=0.125 2023-10-11 04:28:01,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=589516.6666666666, ans=0.0 2023-10-11 04:28:07,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=589563.3333333334, ans=0.0 2023-10-11 04:28:15,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=589610.0, ans=0.025 2023-10-11 04:28:28,369 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:28:33,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=589656.6666666666, ans=0.125 2023-10-11 04:28:45,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.44 vs. limit=15.0 2023-10-11 04:28:58,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=589750.0, ans=0.0 2023-10-11 04:28:58,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2023-10-11 04:29:01,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=589796.6666666666, ans=0.0 2023-10-11 04:29:12,722 INFO [train.py:1031] (3/4) Epoch 10, batch 3500, loss[loss=0.2151, simple_loss=0.3073, pruned_loss=0.06144, over 16875.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2946, pruned_loss=0.05994, over 27109341.23 frames. ], batch size: 116, lr: 3.64e-03, grad_scale: 32.0 2023-10-11 04:29:22,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=589843.3333333334, ans=0.0 2023-10-11 04:29:29,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=589890.0, ans=0.125 2023-10-11 04:29:29,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.80 vs. limit=15.0 2023-10-11 04:29:40,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.691e+02 1.840e+02 2.001e+02 2.941e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-11 04:29:46,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=589983.3333333334, ans=0.125 2023-10-11 04:30:06,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=590030.0, ans=0.125 2023-10-11 04:30:11,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=590076.6666666666, ans=0.125 2023-10-11 04:30:14,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=590076.6666666666, ans=0.125 2023-10-11 04:30:20,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=590123.3333333334, ans=0.5 2023-10-11 04:30:20,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.28 vs. limit=10.0 2023-10-11 04:30:45,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=590170.0, ans=0.2 2023-10-11 04:30:56,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=590216.6666666666, ans=0.0 2023-10-11 04:30:59,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=590216.6666666666, ans=22.5 2023-10-11 04:31:20,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=590310.0, ans=0.02 2023-10-11 04:31:22,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=590310.0, ans=0.0 2023-10-11 04:31:30,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=590356.6666666666, ans=0.05 2023-10-11 04:31:35,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=12.0 2023-10-11 04:31:41,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.736e+02 1.937e+02 2.184e+02 2.613e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-11 04:31:52,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=590450.0, ans=0.0 2023-10-11 04:31:53,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=590450.0, ans=0.035 2023-10-11 04:32:02,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-10-11 04:32:14,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=590543.3333333334, ans=0.0 2023-10-11 04:32:21,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=590543.3333333334, ans=0.0 2023-10-11 04:32:27,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=590590.0, ans=0.125 2023-10-11 04:32:30,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=15.0 2023-10-11 04:32:32,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-10-11 04:32:34,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=15.0 2023-10-11 04:32:39,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=590636.6666666666, ans=0.1 2023-10-11 04:32:45,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=590636.6666666666, ans=0.0 2023-10-11 04:32:48,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=590683.3333333334, ans=0.5 2023-10-11 04:32:51,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=590683.3333333334, ans=0.0 2023-10-11 04:32:54,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=590683.3333333334, ans=0.09899494936611666 2023-10-11 04:33:04,214 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:33:10,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=590776.6666666666, ans=0.125 2023-10-11 04:33:25,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=590823.3333333334, ans=0.0 2023-10-11 04:33:28,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=590823.3333333334, ans=0.1 2023-10-11 04:33:44,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.669e+02 1.868e+02 2.127e+02 3.030e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 04:33:45,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=590870.0, ans=0.2 2023-10-11 04:33:49,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=590870.0, ans=0.0 2023-10-11 04:33:56,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.93 vs. limit=15.0 2023-10-11 04:34:03,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=590963.3333333334, ans=0.125 2023-10-11 04:34:09,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-10-11 04:34:21,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.56 vs. limit=22.5 2023-10-11 04:34:27,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=591056.6666666666, ans=0.0 2023-10-11 04:34:29,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=591056.6666666666, ans=0.2 2023-10-11 04:34:39,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-10-11 04:34:44,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=591103.3333333334, ans=0.125 2023-10-11 04:34:52,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=591150.0, ans=0.125 2023-10-11 04:34:57,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=591150.0, ans=0.0 2023-10-11 04:35:05,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=591196.6666666666, ans=0.125 2023-10-11 04:35:19,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=591243.3333333334, ans=0.035 2023-10-11 04:35:46,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.618e+02 1.775e+02 1.986e+02 3.331e+02, threshold=3.551e+02, percent-clipped=0.0 2023-10-11 04:35:48,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=591336.6666666666, ans=0.125 2023-10-11 04:35:54,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=591383.3333333334, ans=0.1 2023-10-11 04:36:04,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=591430.0, ans=0.125 2023-10-11 04:36:33,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=591523.3333333334, ans=0.125 2023-10-11 04:36:39,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.92 vs. limit=6.0 2023-10-11 04:36:41,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=591570.0, ans=0.125 2023-10-11 04:36:47,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=591616.6666666666, ans=0.0 2023-10-11 04:36:48,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=591616.6666666666, ans=0.05 2023-10-11 04:36:53,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=591616.6666666666, ans=0.0 2023-10-11 04:37:14,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-10-11 04:37:31,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=591803.3333333334, ans=0.2 2023-10-11 04:37:36,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.703e+02 1.860e+02 2.243e+02 3.497e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-11 04:37:55,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=591896.6666666666, ans=0.1 2023-10-11 04:38:06,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=591943.3333333334, ans=0.0 2023-10-11 04:38:35,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=592036.6666666666, ans=0.125 2023-10-11 04:38:37,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=592083.3333333334, ans=0.0 2023-10-11 04:38:44,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=592083.3333333334, ans=0.0 2023-10-11 04:38:55,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=592130.0, ans=0.125 2023-10-11 04:39:00,198 INFO [train.py:1031] (3/4) Epoch 10, batch 4000, loss[loss=0.2646, simple_loss=0.3383, pruned_loss=0.09545, over 15637.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2942, pruned_loss=0.05986, over 28396508.90 frames. ], batch size: 350, lr: 3.63e-03, grad_scale: 32.0 2023-10-11 04:39:11,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=592176.6666666666, ans=0.2 2023-10-11 04:39:26,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=592270.0, ans=0.0 2023-10-11 04:39:33,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.729e+02 1.898e+02 2.095e+02 2.866e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 04:39:34,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=592270.0, ans=0.5 2023-10-11 04:39:40,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=592316.6666666666, ans=0.125 2023-10-11 04:39:45,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-10-11 04:39:51,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=592363.3333333334, ans=0.0 2023-10-11 04:39:51,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=15.0 2023-10-11 04:40:15,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=592456.6666666666, ans=0.0 2023-10-11 04:40:28,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=592503.3333333334, ans=0.1 2023-10-11 04:40:30,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=592503.3333333334, ans=0.125 2023-10-11 04:41:12,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592690.0, ans=0.1 2023-10-11 04:41:25,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.699e+02 1.875e+02 2.140e+02 2.983e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-11 04:41:35,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=592783.3333333334, ans=0.015 2023-10-11 04:41:41,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=592830.0, ans=0.125 2023-10-11 04:41:52,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.98 vs. limit=15.0 2023-10-11 04:42:02,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=592876.6666666666, ans=10.0 2023-10-11 04:42:03,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.89 vs. limit=12.0 2023-10-11 04:42:28,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=592970.0, ans=0.125 2023-10-11 04:42:31,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-10-11 04:42:35,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=593016.6666666666, ans=0.125 2023-10-11 04:42:58,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=593063.3333333334, ans=0.0 2023-10-11 04:43:01,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=593063.3333333334, ans=0.0 2023-10-11 04:43:10,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=593110.0, ans=0.125 2023-10-11 04:43:10,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=593110.0, ans=0.125 2023-10-11 04:43:17,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=593156.6666666666, ans=0.125 2023-10-11 04:43:32,476 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.267e+02 1.699e+02 1.905e+02 2.275e+02 2.917e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-11 04:43:35,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593203.3333333334, ans=0.1 2023-10-11 04:43:55,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593296.6666666666, ans=0.1 2023-10-11 04:44:01,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-10-11 04:44:43,872 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=12.0 2023-10-11 04:44:50,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-11 04:44:56,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=593576.6666666666, ans=10.0 2023-10-11 04:44:57,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=593576.6666666666, ans=0.125 2023-10-11 04:45:04,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-10-11 04:45:14,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=593623.3333333334, ans=0.125 2023-10-11 04:45:23,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.687e+02 1.869e+02 2.093e+02 2.887e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-11 04:45:23,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=593670.0, ans=0.2 2023-10-11 04:45:46,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=593763.3333333334, ans=0.125 2023-10-11 04:45:48,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-11 04:45:51,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=593810.0, ans=0.2 2023-10-11 04:46:03,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593856.6666666666, ans=0.1 2023-10-11 04:46:12,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.44 vs. limit=10.0 2023-10-11 04:46:16,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593903.3333333334, ans=0.1 2023-10-11 04:46:18,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=593903.3333333334, ans=0.125 2023-10-11 04:46:42,213 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:46:43,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593996.6666666666, ans=0.1 2023-10-11 04:47:15,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.756e+02 2.018e+02 2.217e+02 3.636e+02, threshold=4.035e+02, percent-clipped=0.0 2023-10-11 04:47:26,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=594183.3333333334, ans=0.1 2023-10-11 04:47:32,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-11 04:47:43,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=594230.0, ans=0.0 2023-10-11 04:47:45,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=594230.0, ans=0.125 2023-10-11 04:47:50,235 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:48:00,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=594276.6666666666, ans=0.125 2023-10-11 04:48:15,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=594370.0, ans=0.125 2023-10-11 04:48:19,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-10-11 04:48:37,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=594463.3333333334, ans=0.0 2023-10-11 04:48:49,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=594510.0, ans=0.1 2023-10-11 04:48:50,340 INFO [train.py:1031] (3/4) Epoch 10, batch 4500, loss[loss=0.229, simple_loss=0.3132, pruned_loss=0.07238, over 16645.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2946, pruned_loss=0.05972, over 29383079.78 frames. ], batch size: 220, lr: 3.63e-03, grad_scale: 16.0 2023-10-11 04:49:01,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-10-11 04:49:16,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=15.0 2023-10-11 04:49:17,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.699e+02 1.963e+02 2.223e+02 2.946e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-11 04:49:20,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=594650.0, ans=0.125 2023-10-11 04:49:21,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=594650.0, ans=0.125 2023-10-11 04:50:18,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=594883.3333333334, ans=0.035 2023-10-11 04:50:28,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594930.0, ans=0.1 2023-10-11 04:50:34,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-11 04:50:40,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=594976.6666666666, ans=0.1 2023-10-11 04:50:41,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=594976.6666666666, ans=0.125 2023-10-11 04:50:49,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=595023.3333333334, ans=0.2 2023-10-11 04:50:54,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2023-10-11 04:50:58,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=595070.0, ans=12.0 2023-10-11 04:51:02,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.673e+02 1.883e+02 2.177e+02 3.004e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 04:51:12,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-10-11 04:51:14,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=595116.6666666666, ans=0.125 2023-10-11 04:51:39,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=595256.6666666666, ans=0.2 2023-10-11 04:51:46,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.85 vs. limit=12.0 2023-10-11 04:51:49,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=595303.3333333334, ans=0.125 2023-10-11 04:51:59,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=595350.0, ans=0.0 2023-10-11 04:52:09,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-10-11 04:52:33,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=595490.0, ans=0.125 2023-10-11 04:52:48,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=595536.6666666666, ans=0.125 2023-10-11 04:52:51,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.701e+02 1.859e+02 2.118e+02 3.217e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-11 04:53:12,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=595630.0, ans=0.125 2023-10-11 04:53:13,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=15.0 2023-10-11 04:53:14,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=595676.6666666666, ans=0.0 2023-10-11 04:53:34,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=595723.3333333334, ans=0.2 2023-10-11 04:53:43,727 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:54:21,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=595956.6666666666, ans=0.0 2023-10-11 04:54:26,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=595956.6666666666, ans=0.0 2023-10-11 04:54:36,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2023-10-11 04:54:36,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=596003.3333333334, ans=0.1 2023-10-11 04:54:41,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.734e+02 1.903e+02 2.114e+02 2.782e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-11 04:55:01,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=596096.6666666666, ans=0.125 2023-10-11 04:55:31,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.16 vs. limit=15.0 2023-10-11 04:55:33,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=596236.6666666666, ans=0.1 2023-10-11 04:56:08,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=596376.6666666666, ans=10.0 2023-10-11 04:56:35,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.747e+02 2.020e+02 2.351e+02 3.391e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-11 04:57:05,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=596610.0, ans=0.125 2023-10-11 04:57:20,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=596656.6666666666, ans=0.125 2023-10-11 04:57:25,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=596703.3333333334, ans=0.0 2023-10-11 04:57:25,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=596703.3333333334, ans=0.0 2023-10-11 04:57:43,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.05 vs. limit=15.0 2023-10-11 04:57:44,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=596750.0, ans=0.1 2023-10-11 04:57:51,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=596796.6666666666, ans=0.2 2023-10-11 04:57:58,133 INFO [train.py:1031] (3/4) Epoch 10, batch 5000, loss[loss=0.2025, simple_loss=0.2869, pruned_loss=0.05904, over 16917.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.294, pruned_loss=0.0597, over 30116709.28 frames. ], batch size: 138, lr: 3.62e-03, grad_scale: 32.0 2023-10-11 04:57:58,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=596843.3333333334, ans=0.0 2023-10-11 04:58:11,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=12.0 2023-10-11 04:58:25,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-11 04:58:29,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.751e+02 1.913e+02 2.182e+02 3.284e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-11 04:58:57,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=597076.6666666666, ans=0.125 2023-10-11 04:59:21,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=597170.0, ans=0.07 2023-10-11 04:59:23,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=597170.0, ans=0.0 2023-10-11 04:59:50,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=597263.3333333334, ans=0.125 2023-10-11 04:59:52,214 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:59:53,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=597310.0, ans=0.125 2023-10-11 04:59:59,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=12.0 2023-10-11 05:00:15,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=597356.6666666666, ans=0.125 2023-10-11 05:00:16,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597356.6666666666, ans=0.1 2023-10-11 05:00:28,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.798e+02 2.002e+02 2.254e+02 3.464e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-11 05:00:33,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=597450.0, ans=0.125 2023-10-11 05:00:36,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=597450.0, ans=0.0 2023-10-11 05:00:48,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=597496.6666666666, ans=0.0 2023-10-11 05:00:59,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=597543.3333333334, ans=0.1 2023-10-11 05:01:00,613 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:01:13,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=597590.0, ans=0.0 2023-10-11 05:01:13,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=597590.0, ans=0.125 2023-10-11 05:01:13,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=597590.0, ans=0.0 2023-10-11 05:01:19,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=597636.6666666666, ans=0.2 2023-10-11 05:01:27,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=597636.6666666666, ans=0.125 2023-10-11 05:01:37,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=597683.3333333334, ans=0.025 2023-10-11 05:01:43,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=597730.0, ans=0.0 2023-10-11 05:01:46,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=597730.0, ans=0.0 2023-10-11 05:01:50,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-10-11 05:01:51,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=597776.6666666666, ans=0.0 2023-10-11 05:01:53,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=597776.6666666666, ans=0.2 2023-10-11 05:01:57,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=597776.6666666666, ans=0.1 2023-10-11 05:01:58,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=597776.6666666666, ans=0.125 2023-10-11 05:01:58,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=597776.6666666666, ans=0.1 2023-10-11 05:02:06,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=597823.3333333334, ans=0.2 2023-10-11 05:02:13,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=597870.0, ans=0.2 2023-10-11 05:02:15,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597870.0, ans=0.1 2023-10-11 05:02:20,561 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.700e+02 1.862e+02 2.114e+02 3.185e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-11 05:02:21,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=597870.0, ans=10.0 2023-10-11 05:02:31,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.39 vs. limit=15.0 2023-10-11 05:02:37,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=597963.3333333334, ans=0.125 2023-10-11 05:02:48,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-10-11 05:02:56,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=598010.0, ans=0.125 2023-10-11 05:02:56,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=598010.0, ans=0.2 2023-10-11 05:03:06,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=598056.6666666666, ans=0.1 2023-10-11 05:03:18,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=598103.3333333334, ans=0.125 2023-10-11 05:03:30,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=598150.0, ans=0.125 2023-10-11 05:03:45,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=598243.3333333334, ans=0.125 2023-10-11 05:04:04,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=598290.0, ans=0.125 2023-10-11 05:04:04,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0 2023-10-11 05:04:15,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.668e+02 1.925e+02 2.208e+02 3.246e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-11 05:04:28,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=598383.3333333334, ans=0.125 2023-10-11 05:04:36,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=598430.0, ans=0.125 2023-10-11 05:04:46,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=598476.6666666666, ans=0.07 2023-10-11 05:04:48,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=598476.6666666666, ans=0.125 2023-10-11 05:04:52,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598476.6666666666, ans=0.1 2023-10-11 05:04:52,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=598476.6666666666, ans=0.2 2023-10-11 05:04:56,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2023-10-11 05:04:58,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=598523.3333333334, ans=0.125 2023-10-11 05:05:17,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=598616.6666666666, ans=0.0 2023-10-11 05:05:34,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-10-11 05:05:59,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=598756.6666666666, ans=0.125 2023-10-11 05:06:02,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598803.3333333334, ans=0.1 2023-10-11 05:06:09,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.240e+02 1.652e+02 1.774e+02 1.971e+02 2.737e+02, threshold=3.548e+02, percent-clipped=0.0 2023-10-11 05:06:16,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=598850.0, ans=0.125 2023-10-11 05:06:32,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=598943.3333333334, ans=0.125 2023-10-11 05:06:38,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=598943.3333333334, ans=0.0 2023-10-11 05:06:42,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=598943.3333333334, ans=0.2 2023-10-11 05:06:45,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=598990.0, ans=0.0 2023-10-11 05:06:52,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=598990.0, ans=0.125 2023-10-11 05:07:29,657 INFO [train.py:1031] (3/4) Epoch 10, batch 5500, loss[loss=0.2336, simple_loss=0.2845, pruned_loss=0.09137, over 12202.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2939, pruned_loss=0.05953, over 30713879.27 frames. ], batch size: 440, lr: 3.61e-03, grad_scale: 32.0 2023-10-11 05:07:40,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=599223.3333333334, ans=0.125 2023-10-11 05:07:48,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=599223.3333333334, ans=0.125 2023-10-11 05:07:58,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.643e+02 1.816e+02 2.015e+02 2.758e+02, threshold=3.632e+02, percent-clipped=0.0 2023-10-11 05:08:11,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=599363.3333333334, ans=0.0 2023-10-11 05:08:21,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=599363.3333333334, ans=0.2 2023-10-11 05:08:31,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=599410.0, ans=0.125 2023-10-11 05:08:32,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=599456.6666666666, ans=0.0 2023-10-11 05:08:37,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=599456.6666666666, ans=0.125 2023-10-11 05:08:44,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=599456.6666666666, ans=0.125 2023-10-11 05:09:04,781 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:09:12,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=599596.6666666666, ans=0.125 2023-10-11 05:09:31,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-10-11 05:09:44,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=599736.6666666666, ans=0.125 2023-10-11 05:09:49,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.718e+02 1.858e+02 2.090e+02 2.948e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-11 05:09:50,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=599736.6666666666, ans=0.125 2023-10-11 05:09:54,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-10-11 05:09:55,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=599783.3333333334, ans=0.125 2023-10-11 05:10:03,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=599830.0, ans=0.125 2023-10-11 05:10:05,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-10-11 05:10:08,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=599830.0, ans=0.2 2023-10-11 05:10:15,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=599876.6666666666, ans=0.125 2023-10-11 05:10:23,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=599876.6666666666, ans=0.0 2023-10-11 05:10:28,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=599923.3333333334, ans=0.2 2023-10-11 05:10:45,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=599970.0, ans=0.025 2023-10-11 05:10:49,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=600016.6666666666, ans=0.0 2023-10-11 05:10:58,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=600016.6666666666, ans=0.1 2023-10-11 05:11:05,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=600063.3333333334, ans=0.0 2023-10-11 05:11:09,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-11 05:11:16,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-10-11 05:11:43,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=600203.3333333334, ans=0.125 2023-10-11 05:11:47,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.784e+02 1.975e+02 2.201e+02 3.508e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-11 05:12:06,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=600296.6666666666, ans=0.0 2023-10-11 05:12:13,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=600343.3333333334, ans=0.05 2023-10-11 05:12:30,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=600390.0, ans=0.0 2023-10-11 05:12:39,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=600436.6666666666, ans=0.125 2023-10-11 05:13:01,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=600530.0, ans=0.125 2023-10-11 05:13:01,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=600530.0, ans=0.125 2023-10-11 05:13:12,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=600576.6666666666, ans=0.2 2023-10-11 05:13:32,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=600623.3333333334, ans=0.125 2023-10-11 05:13:42,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.623e+02 1.798e+02 2.109e+02 3.115e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-11 05:13:43,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=600670.0, ans=0.0 2023-10-11 05:13:52,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=600716.6666666666, ans=0.2 2023-10-11 05:13:54,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=600716.6666666666, ans=0.125 2023-10-11 05:13:59,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=600763.3333333334, ans=0.125 2023-10-11 05:14:09,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=600810.0, ans=0.125 2023-10-11 05:14:24,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600856.6666666666, ans=0.1 2023-10-11 05:14:26,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=600856.6666666666, ans=0.04949747468305833 2023-10-11 05:14:30,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-10-11 05:14:54,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=600996.6666666666, ans=0.2 2023-10-11 05:14:56,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-10-11 05:15:17,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-10-11 05:15:20,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=601090.0, ans=0.125 2023-10-11 05:15:27,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=601136.6666666666, ans=0.125 2023-10-11 05:15:27,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=601136.6666666666, ans=0.1 2023-10-11 05:15:35,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.674e+02 1.862e+02 2.039e+02 2.917e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-11 05:15:48,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=601230.0, ans=0.125 2023-10-11 05:16:07,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=601276.6666666666, ans=0.125 2023-10-11 05:16:09,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=601323.3333333334, ans=0.125 2023-10-11 05:16:11,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=601323.3333333334, ans=0.2 2023-10-11 05:16:22,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2023-10-11 05:16:29,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=15.0 2023-10-11 05:16:53,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=601463.3333333334, ans=0.0 2023-10-11 05:16:55,369 INFO [train.py:1031] (3/4) Epoch 10, batch 6000, loss[loss=0.2065, simple_loss=0.2945, pruned_loss=0.05922, over 16868.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2942, pruned_loss=0.05975, over 31158069.88 frames. ], batch size: 72, lr: 3.60e-03, grad_scale: 32.0 2023-10-11 05:16:59,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=601510.0, ans=0.125 2023-10-11 05:17:11,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=601556.6666666666, ans=0.2 2023-10-11 05:17:16,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.77 vs. limit=15.0 2023-10-11 05:17:24,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=601603.3333333334, ans=0.0 2023-10-11 05:17:28,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.689e+02 1.881e+02 2.114e+02 2.953e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-11 05:17:31,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.29 vs. limit=15.0 2023-10-11 05:17:45,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=601696.6666666666, ans=0.125 2023-10-11 05:18:02,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.98 vs. limit=10.0 2023-10-11 05:18:07,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=601790.0, ans=0.125 2023-10-11 05:18:14,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=601790.0, ans=0.0 2023-10-11 05:18:16,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=601836.6666666666, ans=0.2 2023-10-11 05:18:34,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-11 05:18:43,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-10-11 05:18:50,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=601976.6666666666, ans=0.125 2023-10-11 05:18:55,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=601976.6666666666, ans=0.0 2023-10-11 05:19:05,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=602023.3333333334, ans=0.05 2023-10-11 05:19:18,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.680e+02 1.916e+02 2.120e+02 3.341e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 05:19:23,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=602116.6666666666, ans=0.125 2023-10-11 05:19:44,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=602210.0, ans=0.1 2023-10-11 05:20:11,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=602303.3333333334, ans=0.0 2023-10-11 05:20:30,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=602396.6666666666, ans=0.2 2023-10-11 05:20:30,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=602396.6666666666, ans=0.125 2023-10-11 05:20:46,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=602443.3333333334, ans=0.125 2023-10-11 05:20:59,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.00 vs. limit=22.5 2023-10-11 05:21:10,519 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.725e+02 1.903e+02 2.051e+02 3.480e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 05:21:16,873 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.63 vs. limit=10.0 2023-10-11 05:21:20,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=602583.3333333334, ans=0.0 2023-10-11 05:21:38,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=602676.6666666666, ans=0.0 2023-10-11 05:21:49,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=602723.3333333334, ans=0.125 2023-10-11 05:21:55,467 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:22:11,791 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-10-11 05:22:15,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=602816.6666666666, ans=0.125 2023-10-11 05:22:19,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=602816.6666666666, ans=0.0 2023-10-11 05:22:22,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-10-11 05:22:34,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=602910.0, ans=0.0 2023-10-11 05:22:46,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=15.0 2023-10-11 05:22:51,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=602956.6666666666, ans=0.125 2023-10-11 05:22:59,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=603003.3333333334, ans=0.0 2023-10-11 05:22:59,356 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.82 vs. limit=10.0 2023-10-11 05:23:05,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.781e+02 1.993e+02 2.269e+02 3.668e+02, threshold=3.985e+02, percent-clipped=0.0 2023-10-11 05:23:18,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=603050.0, ans=0.125 2023-10-11 05:23:22,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.46 vs. limit=6.0 2023-10-11 05:23:34,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=603143.3333333334, ans=0.125 2023-10-11 05:23:39,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=22.5 2023-10-11 05:23:50,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=603190.0, ans=0.0 2023-10-11 05:23:52,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=603190.0, ans=0.125 2023-10-11 05:24:14,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=603236.6666666666, ans=0.125 2023-10-11 05:24:23,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=603283.3333333334, ans=0.0 2023-10-11 05:24:35,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=603330.0, ans=0.0 2023-10-11 05:24:49,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-10-11 05:24:51,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2023-10-11 05:24:59,262 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.89 vs. limit=22.5 2023-10-11 05:25:06,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-10-11 05:25:10,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.667e+02 1.902e+02 2.282e+02 4.065e+02, threshold=3.804e+02, percent-clipped=1.0 2023-10-11 05:25:37,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.50 vs. limit=15.0 2023-10-11 05:25:45,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-10-11 05:26:19,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=603750.0, ans=0.0 2023-10-11 05:26:20,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=603796.6666666666, ans=0.125 2023-10-11 05:26:23,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=603796.6666666666, ans=0.09899494936611666 2023-10-11 05:26:28,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=603796.6666666666, ans=0.0 2023-10-11 05:26:30,551 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:26:31,254 INFO [train.py:1031] (3/4) Epoch 10, batch 6500, loss[loss=0.1897, simple_loss=0.2817, pruned_loss=0.04884, over 16880.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2947, pruned_loss=0.05982, over 31542819.41 frames. ], batch size: 87, lr: 3.60e-03, grad_scale: 32.0 2023-10-11 05:27:10,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.719e+02 1.863e+02 2.138e+02 2.711e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 05:27:26,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=604030.0, ans=0.1 2023-10-11 05:27:27,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=604030.0, ans=0.1 2023-10-11 05:27:43,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-10-11 05:27:45,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=604076.6666666666, ans=0.125 2023-10-11 05:27:46,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=604076.6666666666, ans=0.125 2023-10-11 05:27:47,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=604076.6666666666, ans=0.125 2023-10-11 05:27:51,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-10-11 05:27:57,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=604123.3333333334, ans=0.125 2023-10-11 05:27:57,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.17 vs. limit=10.0 2023-10-11 05:28:12,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=604170.0, ans=0.125 2023-10-11 05:28:37,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=604310.0, ans=0.0 2023-10-11 05:28:55,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=604356.6666666666, ans=0.2 2023-10-11 05:28:56,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=604356.6666666666, ans=15.0 2023-10-11 05:28:57,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=604403.3333333334, ans=0.05 2023-10-11 05:29:06,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.788e+02 1.944e+02 2.280e+02 3.318e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-11 05:29:17,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=604450.0, ans=0.0 2023-10-11 05:30:15,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.12 vs. limit=15.0 2023-10-11 05:30:38,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=604823.3333333334, ans=0.0 2023-10-11 05:30:40,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=604823.3333333334, ans=0.09899494936611666 2023-10-11 05:30:43,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.57 vs. limit=15.0 2023-10-11 05:30:49,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=604870.0, ans=0.0 2023-10-11 05:30:58,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.688e+02 1.822e+02 2.015e+02 2.853e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 05:31:04,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=604916.6666666666, ans=0.2 2023-10-11 05:31:10,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=604963.3333333334, ans=0.125 2023-10-11 05:31:27,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=605010.0, ans=0.2 2023-10-11 05:31:36,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=605056.6666666666, ans=0.125 2023-10-11 05:31:46,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=605103.3333333334, ans=0.125 2023-10-11 05:31:52,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=605103.3333333334, ans=0.0 2023-10-11 05:32:04,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=605150.0, ans=0.02 2023-10-11 05:32:23,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=605196.6666666666, ans=0.0 2023-10-11 05:32:27,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=605243.3333333334, ans=0.1 2023-10-11 05:32:44,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.51 vs. limit=15.0 2023-10-11 05:33:09,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.603e+02 1.742e+02 1.942e+02 3.490e+02, threshold=3.485e+02, percent-clipped=0.0 2023-10-11 05:33:10,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=605383.3333333334, ans=0.1 2023-10-11 05:33:14,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=605383.3333333334, ans=0.04949747468305833 2023-10-11 05:33:23,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=605430.0, ans=0.0 2023-10-11 05:33:31,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=605430.0, ans=0.125 2023-10-11 05:33:47,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=605523.3333333334, ans=0.125 2023-10-11 05:33:56,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=605570.0, ans=0.2 2023-10-11 05:33:56,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=605570.0, ans=0.125 2023-10-11 05:34:04,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=605570.0, ans=0.07 2023-10-11 05:34:12,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=605616.6666666666, ans=0.125 2023-10-11 05:34:17,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=605616.6666666666, ans=0.0 2023-10-11 05:34:18,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=605616.6666666666, ans=0.0 2023-10-11 05:34:47,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=605756.6666666666, ans=0.125 2023-10-11 05:34:50,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=605756.6666666666, ans=0.125 2023-10-11 05:35:01,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.272e+02 1.657e+02 1.848e+02 2.233e+02 3.231e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-11 05:35:21,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=605896.6666666666, ans=0.2 2023-10-11 05:35:23,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=605943.3333333334, ans=0.125 2023-10-11 05:36:05,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=606130.0, ans=0.125 2023-10-11 05:36:13,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=606130.0, ans=0.0 2023-10-11 05:36:17,216 INFO [train.py:1031] (3/4) Epoch 10, batch 7000, loss[loss=0.1784, simple_loss=0.2703, pruned_loss=0.04326, over 15390.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.295, pruned_loss=0.05973, over 31817532.50 frames. ], batch size: 35, lr: 3.59e-03, grad_scale: 32.0 2023-10-11 05:36:23,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=606176.6666666666, ans=0.125 2023-10-11 05:36:23,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=606176.6666666666, ans=0.125 2023-10-11 05:36:35,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=606223.3333333334, ans=0.0 2023-10-11 05:36:50,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-10-11 05:36:54,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.722e+02 1.898e+02 2.060e+02 2.695e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 05:37:01,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-10-11 05:37:01,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=606316.6666666666, ans=0.1 2023-10-11 05:37:10,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=606363.3333333334, ans=0.1 2023-10-11 05:37:37,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2023-10-11 05:37:54,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.03 vs. limit=15.0 2023-10-11 05:37:54,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=606550.0, ans=0.0 2023-10-11 05:38:03,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=606596.6666666666, ans=0.125 2023-10-11 05:38:11,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=606643.3333333334, ans=0.025 2023-10-11 05:38:16,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=606643.3333333334, ans=0.125 2023-10-11 05:38:22,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=606690.0, ans=0.125 2023-10-11 05:38:43,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.749e+02 2.024e+02 2.406e+02 3.274e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-11 05:38:44,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=606783.3333333334, ans=0.125 2023-10-11 05:38:52,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-10-11 05:39:04,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=606830.0, ans=0.125 2023-10-11 05:39:15,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=606876.6666666666, ans=0.2 2023-10-11 05:39:38,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=606970.0, ans=0.0 2023-10-11 05:39:41,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=15.0 2023-10-11 05:39:42,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=607016.6666666666, ans=0.0 2023-10-11 05:39:49,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.89 vs. limit=10.0 2023-10-11 05:39:51,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=607063.3333333334, ans=0.125 2023-10-11 05:40:04,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=607110.0, ans=0.0 2023-10-11 05:40:48,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.763e+02 1.918e+02 2.092e+02 3.391e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-11 05:40:56,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=607250.0, ans=0.025 2023-10-11 05:41:02,964 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:41:13,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-10-11 05:41:19,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=607343.3333333334, ans=0.125 2023-10-11 05:41:28,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-10-11 05:41:29,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=607390.0, ans=0.125 2023-10-11 05:41:37,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.47 vs. limit=15.0 2023-10-11 05:42:09,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=607530.0, ans=0.2 2023-10-11 05:42:47,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.668e+02 1.796e+02 2.002e+02 2.838e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 05:42:55,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.98 vs. limit=15.0 2023-10-11 05:43:36,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=607903.3333333334, ans=0.125 2023-10-11 05:44:08,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.20 vs. limit=6.0 2023-10-11 05:44:14,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=608043.3333333334, ans=0.125 2023-10-11 05:44:19,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=608090.0, ans=0.0 2023-10-11 05:44:19,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=608090.0, ans=0.125 2023-10-11 05:44:34,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=608136.6666666666, ans=0.125 2023-10-11 05:44:38,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.745e+02 2.046e+02 2.441e+02 3.789e+02, threshold=4.092e+02, percent-clipped=3.0 2023-10-11 05:44:48,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=608183.3333333334, ans=0.0 2023-10-11 05:44:55,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=608230.0, ans=0.025 2023-10-11 05:44:55,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=608230.0, ans=0.1 2023-10-11 05:45:17,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=608323.3333333334, ans=0.2 2023-10-11 05:45:23,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=608370.0, ans=0.0 2023-10-11 05:45:23,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=608370.0, ans=0.125 2023-10-11 05:45:33,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=608416.6666666666, ans=0.0 2023-10-11 05:45:40,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=608416.6666666666, ans=0.0 2023-10-11 05:45:44,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=608463.3333333334, ans=0.0 2023-10-11 05:45:45,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=608463.3333333334, ans=0.125 2023-10-11 05:45:55,086 INFO [train.py:1031] (3/4) Epoch 10, batch 7500, loss[loss=0.2307, simple_loss=0.3172, pruned_loss=0.07209, over 16831.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.295, pruned_loss=0.05975, over 32010934.75 frames. ], batch size: 188, lr: 3.58e-03, grad_scale: 32.0 2023-10-11 05:45:56,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=608510.0, ans=0.125 2023-10-11 05:46:01,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=608510.0, ans=0.09899494936611666 2023-10-11 05:46:23,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=608603.3333333334, ans=0.035 2023-10-11 05:46:28,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.745e+02 1.943e+02 2.282e+02 3.984e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-11 05:46:31,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=608650.0, ans=0.125 2023-10-11 05:47:11,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=608790.0, ans=0.0 2023-10-11 05:47:24,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=608836.6666666666, ans=0.1 2023-10-11 05:47:52,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-11 05:47:54,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=608976.6666666666, ans=0.125 2023-10-11 05:48:19,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=609070.0, ans=0.125 2023-10-11 05:48:21,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=609070.0, ans=0.95 2023-10-11 05:48:23,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.676e+02 1.857e+02 2.112e+02 2.881e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-11 05:49:02,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=609210.0, ans=0.125 2023-10-11 05:49:25,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609303.3333333334, ans=0.1 2023-10-11 05:49:45,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-10-11 05:49:48,194 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:49:50,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=609396.6666666666, ans=0.05 2023-10-11 05:50:27,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.636e+02 1.793e+02 2.043e+02 3.357e+02, threshold=3.586e+02, percent-clipped=0.0 2023-10-11 05:51:07,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=609723.3333333334, ans=0.0 2023-10-11 05:51:08,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=609723.3333333334, ans=0.1 2023-10-11 05:51:08,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=609723.3333333334, ans=0.1 2023-10-11 05:51:09,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=609723.3333333334, ans=0.0 2023-10-11 05:51:23,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=609816.6666666666, ans=0.125 2023-10-11 05:51:32,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.48 vs. limit=12.0 2023-10-11 05:52:00,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=609956.6666666666, ans=0.95 2023-10-11 05:52:20,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.692e+02 1.850e+02 2.078e+02 3.142e+02, threshold=3.700e+02, percent-clipped=0.0 2023-10-11 05:52:44,061 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.78 vs. limit=15.0 2023-10-11 05:52:58,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=610190.0, ans=0.0 2023-10-11 05:53:03,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=610190.0, ans=0.125 2023-10-11 05:53:06,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2023-10-11 05:53:23,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=610283.3333333334, ans=0.125 2023-10-11 05:53:46,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=14.20 vs. limit=15.0 2023-10-11 05:53:49,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.32 vs. limit=10.0 2023-10-11 05:54:09,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=610470.0, ans=0.125 2023-10-11 05:54:15,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.740e+02 1.948e+02 2.213e+02 3.755e+02, threshold=3.896e+02, percent-clipped=1.0 2023-10-11 05:54:20,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.06 vs. limit=10.0 2023-10-11 05:54:24,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=610516.6666666666, ans=0.0 2023-10-11 05:54:32,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-10-11 05:54:42,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=610610.0, ans=0.2 2023-10-11 05:54:55,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=610656.6666666666, ans=0.0 2023-10-11 05:54:57,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-11 05:55:08,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=610703.3333333334, ans=0.1 2023-10-11 05:55:10,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=610703.3333333334, ans=0.0 2023-10-11 05:55:15,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=610750.0, ans=0.1 2023-10-11 05:55:17,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=610750.0, ans=0.0 2023-10-11 05:55:27,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=610796.6666666666, ans=0.125 2023-10-11 05:55:28,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=610796.6666666666, ans=0.125 2023-10-11 05:55:37,049 INFO [train.py:1031] (3/4) Epoch 10, batch 8000, loss[loss=0.2028, simple_loss=0.2929, pruned_loss=0.05637, over 16329.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2943, pruned_loss=0.05921, over 32196306.04 frames. ], batch size: 50, lr: 3.58e-03, grad_scale: 32.0 2023-10-11 05:55:58,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-11 05:56:04,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=22.5 2023-10-11 05:56:05,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=610936.6666666666, ans=0.125 2023-10-11 05:56:06,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=610936.6666666666, ans=0.125 2023-10-11 05:56:11,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.630e+02 1.875e+02 2.146e+02 3.389e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 05:56:12,296 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:56:33,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611076.6666666666, ans=0.1 2023-10-11 05:56:36,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=611076.6666666666, ans=0.0 2023-10-11 05:56:50,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=611123.3333333334, ans=0.125 2023-10-11 05:57:00,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-11 05:57:05,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=611216.6666666666, ans=0.125 2023-10-11 05:57:17,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=611263.3333333334, ans=0.0 2023-10-11 05:57:28,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=611310.0, ans=0.125 2023-10-11 05:57:52,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.14 vs. limit=15.0 2023-10-11 05:57:54,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=611403.3333333334, ans=0.0 2023-10-11 05:57:56,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.719e+02 1.872e+02 2.136e+02 2.984e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-11 05:58:00,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-10-11 05:58:07,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=611496.6666666666, ans=0.0 2023-10-11 05:58:08,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=611496.6666666666, ans=0.125 2023-10-11 05:58:18,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=611496.6666666666, ans=0.125 2023-10-11 05:58:42,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-10-11 05:59:02,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=611636.6666666666, ans=0.125 2023-10-11 05:59:03,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=611636.6666666666, ans=0.2 2023-10-11 05:59:12,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=611683.3333333334, ans=0.125 2023-10-11 05:59:35,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=611776.6666666666, ans=0.1 2023-10-11 05:59:43,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611823.3333333334, ans=0.1 2023-10-11 05:59:50,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=611823.3333333334, ans=0.125 2023-10-11 06:00:01,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=611870.0, ans=0.05 2023-10-11 06:00:04,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.712e+02 1.885e+02 2.119e+02 3.410e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 06:00:06,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=611916.6666666666, ans=0.0 2023-10-11 06:00:12,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=611916.6666666666, ans=0.0 2023-10-11 06:00:16,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=611963.3333333334, ans=0.125 2023-10-11 06:00:39,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=612056.6666666666, ans=0.04949747468305833 2023-10-11 06:01:00,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=612150.0, ans=0.125 2023-10-11 06:01:06,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.57 vs. limit=10.0 2023-10-11 06:01:09,596 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-11 06:01:22,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=612243.3333333334, ans=0.035 2023-10-11 06:01:33,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-10-11 06:01:59,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.618e+02 1.812e+02 2.014e+02 2.615e+02, threshold=3.624e+02, percent-clipped=0.0 2023-10-11 06:02:25,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=612476.6666666666, ans=0.07 2023-10-11 06:02:32,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2023-10-11 06:02:32,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=612523.3333333334, ans=0.125 2023-10-11 06:02:37,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.83 vs. limit=10.0 2023-10-11 06:02:41,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=612523.3333333334, ans=0.0 2023-10-11 06:02:58,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=612616.6666666666, ans=0.1 2023-10-11 06:03:23,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=612710.0, ans=0.0 2023-10-11 06:03:26,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=12.0 2023-10-11 06:03:28,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-10-11 06:03:38,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=612756.6666666666, ans=0.125 2023-10-11 06:03:57,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.654e+02 1.751e+02 1.907e+02 2.922e+02, threshold=3.502e+02, percent-clipped=0.0 2023-10-11 06:04:26,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.36 vs. limit=15.0 2023-10-11 06:05:17,898 INFO [train.py:1031] (3/4) Epoch 10, batch 8500, loss[loss=0.2035, simple_loss=0.2919, pruned_loss=0.05754, over 16817.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2947, pruned_loss=0.05918, over 32350217.40 frames. ], batch size: 175, lr: 3.57e-03, grad_scale: 32.0 2023-10-11 06:05:49,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=613270.0, ans=0.0 2023-10-11 06:05:51,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=613316.6666666666, ans=0.1 2023-10-11 06:05:54,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.810e+02 2.011e+02 2.393e+02 3.434e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-11 06:06:03,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=613363.3333333334, ans=0.0 2023-10-11 06:06:10,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=613363.3333333334, ans=0.125 2023-10-11 06:06:11,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=613363.3333333334, ans=0.0 2023-10-11 06:06:20,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=613410.0, ans=0.125 2023-10-11 06:06:24,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=613410.0, ans=0.125 2023-10-11 06:06:50,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.35 vs. limit=22.5 2023-10-11 06:07:09,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=613596.6666666666, ans=0.125 2023-10-11 06:07:16,290 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2023-10-11 06:07:39,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=613690.0, ans=0.125 2023-10-11 06:07:46,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=613736.6666666666, ans=0.0 2023-10-11 06:08:00,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.767e+02 1.981e+02 2.298e+02 3.299e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-11 06:08:00,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=613783.3333333334, ans=10.0 2023-10-11 06:08:03,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-10-11 06:08:11,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=613830.0, ans=0.0 2023-10-11 06:08:14,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=613830.0, ans=0.0 2023-10-11 06:08:17,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=613830.0, ans=0.125 2023-10-11 06:08:35,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=613923.3333333334, ans=0.125 2023-10-11 06:08:39,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=613923.3333333334, ans=0.2 2023-10-11 06:08:39,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=613923.3333333334, ans=0.125 2023-10-11 06:09:04,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=614016.6666666666, ans=0.125 2023-10-11 06:09:12,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=614063.3333333334, ans=0.95 2023-10-11 06:09:18,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=614063.3333333334, ans=0.125 2023-10-11 06:09:35,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=614156.6666666666, ans=0.125 2023-10-11 06:09:38,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=614156.6666666666, ans=0.125 2023-10-11 06:09:46,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=22.5 2023-10-11 06:09:52,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614203.3333333334, ans=0.1 2023-10-11 06:09:54,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=614203.3333333334, ans=0.125 2023-10-11 06:10:01,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.620e+02 1.804e+02 2.266e+02 2.962e+02, threshold=3.607e+02, percent-clipped=0.0 2023-10-11 06:10:02,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=12.0 2023-10-11 06:10:34,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=614343.3333333334, ans=0.2 2023-10-11 06:10:51,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=614436.6666666666, ans=0.125 2023-10-11 06:11:14,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-10-11 06:11:28,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=614576.6666666666, ans=0.125 2023-10-11 06:11:51,267 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.05 vs. limit=15.0 2023-10-11 06:11:51,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.25 vs. limit=15.0 2023-10-11 06:11:57,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=614716.6666666666, ans=0.0 2023-10-11 06:11:58,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=614716.6666666666, ans=0.125 2023-10-11 06:11:58,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2023-10-11 06:11:59,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.593e+02 1.785e+02 2.016e+02 2.752e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-11 06:12:15,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=614763.3333333334, ans=0.0 2023-10-11 06:12:16,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=12.0 2023-10-11 06:12:20,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=15.0 2023-10-11 06:12:21,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=614810.0, ans=0.125 2023-10-11 06:13:01,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=614996.6666666666, ans=0.0 2023-10-11 06:13:06,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-10-11 06:13:22,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=615043.3333333334, ans=0.125 2023-10-11 06:13:32,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=615090.0, ans=0.2 2023-10-11 06:13:38,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-10-11 06:13:49,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.677e+02 1.899e+02 2.179e+02 3.053e+02, threshold=3.799e+02, percent-clipped=0.0 2023-10-11 06:14:13,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=615276.6666666666, ans=0.0 2023-10-11 06:14:15,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=615276.6666666666, ans=0.125 2023-10-11 06:14:30,160 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:14:34,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=615370.0, ans=0.125 2023-10-11 06:14:40,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=615370.0, ans=0.0 2023-10-11 06:15:05,548 INFO [train.py:1031] (3/4) Epoch 10, batch 9000, loss[loss=0.2774, simple_loss=0.3456, pruned_loss=0.1046, over 15652.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.294, pruned_loss=0.05885, over 32456028.65 frames. ], batch size: 350, lr: 3.56e-03, grad_scale: 32.0 2023-10-11 06:15:06,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-10-11 06:15:15,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=615510.0, ans=0.0 2023-10-11 06:15:20,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-10-11 06:15:38,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=615650.0, ans=0.125 2023-10-11 06:15:40,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.700e+02 1.912e+02 2.198e+02 4.599e+02, threshold=3.824e+02, percent-clipped=1.0 2023-10-11 06:15:50,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=12.0 2023-10-11 06:16:11,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=615790.0, ans=0.125 2023-10-11 06:16:18,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=615790.0, ans=0.0 2023-10-11 06:16:18,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615790.0, ans=0.1 2023-10-11 06:16:56,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=615976.6666666666, ans=0.0 2023-10-11 06:16:58,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=615976.6666666666, ans=0.125 2023-10-11 06:17:02,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=8.0 2023-10-11 06:17:29,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.663e+02 1.884e+02 2.085e+02 2.906e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-11 06:17:29,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2023-10-11 06:17:39,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=616163.3333333334, ans=0.125 2023-10-11 06:17:47,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=616210.0, ans=0.2 2023-10-11 06:17:48,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=616210.0, ans=0.5 2023-10-11 06:17:59,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=616256.6666666666, ans=0.125 2023-10-11 06:18:05,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=616256.6666666666, ans=0.125 2023-10-11 06:18:16,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=616303.3333333334, ans=0.0 2023-10-11 06:18:37,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=616396.6666666666, ans=0.0 2023-10-11 06:19:02,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-10-11 06:19:15,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=616583.3333333334, ans=0.125 2023-10-11 06:19:16,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.705e+02 1.934e+02 2.142e+02 3.613e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-11 06:19:19,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.69 vs. limit=15.0 2023-10-11 06:19:22,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616583.3333333334, ans=0.1 2023-10-11 06:19:22,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=616583.3333333334, ans=0.125 2023-10-11 06:20:19,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.85 vs. limit=15.0 2023-10-11 06:21:00,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=617003.3333333334, ans=0.125 2023-10-11 06:21:06,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.712e+02 1.920e+02 2.200e+02 3.268e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-11 06:22:00,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617236.6666666666, ans=0.1 2023-10-11 06:22:04,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=617236.6666666666, ans=0.125 2023-10-11 06:22:15,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=617283.3333333334, ans=10.0 2023-10-11 06:22:29,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-10-11 06:23:01,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-10-11 06:23:10,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.684e+02 1.913e+02 2.210e+02 3.109e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-11 06:23:13,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=617516.6666666666, ans=0.04949747468305833 2023-10-11 06:23:17,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=617516.6666666666, ans=0.125 2023-10-11 06:23:20,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-11 06:23:37,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=617610.0, ans=0.2 2023-10-11 06:24:07,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=617750.0, ans=0.0 2023-10-11 06:24:32,637 INFO [train.py:1031] (3/4) Epoch 10, batch 9500, loss[loss=0.1995, simple_loss=0.2961, pruned_loss=0.05143, over 16907.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2947, pruned_loss=0.05906, over 32538327.30 frames. ], batch size: 104, lr: 3.56e-03, grad_scale: 32.0 2023-10-11 06:25:08,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.716e+02 1.931e+02 2.238e+02 3.502e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 06:25:21,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=618030.0, ans=0.0 2023-10-11 06:25:28,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=618076.6666666666, ans=0.125 2023-10-11 06:25:45,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=618123.3333333334, ans=0.125 2023-10-11 06:25:57,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=15.0 2023-10-11 06:26:01,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=618170.0, ans=0.125 2023-10-11 06:26:14,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=618263.3333333334, ans=0.2 2023-10-11 06:26:37,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=618310.0, ans=0.07 2023-10-11 06:27:01,048 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-10-11 06:27:03,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.745e+02 2.009e+02 2.244e+02 2.944e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-11 06:27:04,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=618450.0, ans=0.07 2023-10-11 06:27:06,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=618450.0, ans=0.125 2023-10-11 06:27:12,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-10-11 06:27:20,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=618496.6666666666, ans=0.125 2023-10-11 06:27:26,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-11 06:27:29,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=618543.3333333334, ans=0.0 2023-10-11 06:28:19,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=618776.6666666666, ans=0.0 2023-10-11 06:28:32,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=618823.3333333334, ans=0.125 2023-10-11 06:28:40,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=618823.3333333334, ans=0.125 2023-10-11 06:28:53,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=618916.6666666666, ans=0.125 2023-10-11 06:28:56,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.705e+02 1.908e+02 2.181e+02 3.387e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-11 06:29:20,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=619010.0, ans=0.0 2023-10-11 06:29:28,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=619056.6666666666, ans=0.125 2023-10-11 06:29:30,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=619056.6666666666, ans=0.2 2023-10-11 06:29:36,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=619056.6666666666, ans=0.1 2023-10-11 06:29:40,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=619103.3333333334, ans=0.125 2023-10-11 06:29:42,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=619103.3333333334, ans=0.125 2023-10-11 06:30:13,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=619196.6666666666, ans=0.1 2023-10-11 06:30:26,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=619290.0, ans=0.125 2023-10-11 06:30:29,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-10-11 06:30:48,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=619383.3333333334, ans=15.0 2023-10-11 06:30:50,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.622e+02 1.773e+02 1.984e+02 2.645e+02, threshold=3.547e+02, percent-clipped=0.0 2023-10-11 06:30:55,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.03 vs. limit=22.5 2023-10-11 06:31:21,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=619523.3333333334, ans=0.1 2023-10-11 06:32:16,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=619710.0, ans=0.1 2023-10-11 06:32:23,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=619756.6666666666, ans=0.125 2023-10-11 06:32:38,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=619803.3333333334, ans=0.0 2023-10-11 06:32:44,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.704e+02 1.851e+02 2.093e+02 2.869e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-11 06:32:47,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=619850.0, ans=0.125 2023-10-11 06:32:51,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-10-11 06:32:53,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=619896.6666666666, ans=0.05 2023-10-11 06:32:58,534 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-10-11 06:33:32,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=620036.6666666666, ans=0.0 2023-10-11 06:33:40,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=620083.3333333334, ans=0.125 2023-10-11 06:33:43,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=620083.3333333334, ans=0.125 2023-10-11 06:33:52,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=620130.0, ans=0.2 2023-10-11 06:33:55,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.14 vs. limit=22.5 2023-10-11 06:33:58,145 INFO [train.py:1031] (3/4) Epoch 10, batch 10000, loss[loss=0.1942, simple_loss=0.2826, pruned_loss=0.05294, over 16555.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2938, pruned_loss=0.05885, over 32578417.75 frames. ], batch size: 50, lr: 3.55e-03, grad_scale: 32.0 2023-10-11 06:33:59,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=620176.6666666666, ans=0.1 2023-10-11 06:34:01,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=620176.6666666666, ans=0.125 2023-10-11 06:34:03,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=620176.6666666666, ans=0.125 2023-10-11 06:34:09,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=620223.3333333334, ans=15.0 2023-10-11 06:34:21,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=620270.0, ans=0.125 2023-10-11 06:34:22,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=620270.0, ans=0.125 2023-10-11 06:34:26,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=620270.0, ans=0.0 2023-10-11 06:34:32,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.637e+02 1.878e+02 2.151e+02 3.015e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 06:34:40,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=620316.6666666666, ans=0.1 2023-10-11 06:34:43,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=620363.3333333334, ans=0.2 2023-10-11 06:34:54,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.25 vs. limit=15.0 2023-10-11 06:35:06,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=22.5 2023-10-11 06:35:09,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-10-11 06:35:11,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=620456.6666666666, ans=0.125 2023-10-11 06:35:14,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=620503.3333333334, ans=0.125 2023-10-11 06:36:08,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=620690.0, ans=0.0 2023-10-11 06:36:10,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=620690.0, ans=0.125 2023-10-11 06:36:31,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.739e+02 1.927e+02 2.255e+02 3.167e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-11 06:36:34,481 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:36:35,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=620783.3333333334, ans=0.5 2023-10-11 06:37:00,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=620876.6666666666, ans=0.125 2023-10-11 06:37:12,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-10-11 06:37:26,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=621016.6666666666, ans=0.125 2023-10-11 06:37:59,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2023-10-11 06:38:00,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=621156.6666666666, ans=0.125 2023-10-11 06:38:00,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-10-11 06:38:17,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=621203.3333333334, ans=0.2 2023-10-11 06:38:27,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.640e+02 1.811e+02 2.026e+02 2.639e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-11 06:38:38,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=621296.6666666666, ans=0.125 2023-10-11 06:38:51,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=621343.3333333334, ans=0.125 2023-10-11 06:38:56,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=621343.3333333334, ans=0.0 2023-10-11 06:38:58,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=12.0 2023-10-11 06:39:02,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-10-11 06:39:37,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=621530.0, ans=0.1 2023-10-11 06:39:43,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=621530.0, ans=0.125 2023-10-11 06:39:51,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=621576.6666666666, ans=0.0 2023-10-11 06:39:57,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=621576.6666666666, ans=0.2 2023-10-11 06:40:00,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=621623.3333333334, ans=0.0 2023-10-11 06:40:00,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=15.0 2023-10-11 06:40:24,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.676e+02 1.865e+02 2.145e+02 3.086e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 06:40:24,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=621716.6666666666, ans=0.125 2023-10-11 06:40:32,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=621763.3333333334, ans=0.0 2023-10-11 06:40:35,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=621763.3333333334, ans=0.0 2023-10-11 06:40:37,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=621763.3333333334, ans=0.125 2023-10-11 06:40:38,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-10-11 06:40:52,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.88 vs. limit=15.0 2023-10-11 06:40:57,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=621856.6666666666, ans=0.2 2023-10-11 06:41:04,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=621856.6666666666, ans=0.0 2023-10-11 06:41:55,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=622043.3333333334, ans=0.0 2023-10-11 06:42:16,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=622136.6666666666, ans=0.2 2023-10-11 06:42:16,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622136.6666666666, ans=0.1 2023-10-11 06:42:22,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.744e+02 1.959e+02 2.147e+02 3.117e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-11 06:42:51,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-11 06:43:04,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=15.0 2023-10-11 06:43:21,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=622416.6666666666, ans=0.125 2023-10-11 06:43:40,684 INFO [train.py:1031] (3/4) Epoch 10, batch 10500, loss[loss=0.2429, simple_loss=0.3169, pruned_loss=0.0845, over 16069.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2943, pruned_loss=0.05901, over 32626293.72 frames. ], batch size: 296, lr: 3.54e-03, grad_scale: 32.0 2023-10-11 06:43:41,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=622510.0, ans=0.0 2023-10-11 06:43:42,214 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-10-11 06:43:47,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-10-11 06:44:13,907 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.658e+02 1.819e+02 2.084e+02 2.855e+02, threshold=3.638e+02, percent-clipped=0.0 2023-10-11 06:44:19,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=622650.0, ans=0.125 2023-10-11 06:44:34,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=622743.3333333334, ans=0.0 2023-10-11 06:44:53,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=622790.0, ans=0.125 2023-10-11 06:45:23,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=622883.3333333334, ans=0.125 2023-10-11 06:45:38,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622930.0, ans=0.1 2023-10-11 06:46:08,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=623070.0, ans=0.0 2023-10-11 06:46:16,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.645e+02 1.826e+02 2.026e+02 2.781e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 06:46:29,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=623163.3333333334, ans=0.125 2023-10-11 06:46:31,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=623163.3333333334, ans=0.0 2023-10-11 06:46:45,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=623210.0, ans=0.125 2023-10-11 06:46:47,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=623256.6666666666, ans=0.0 2023-10-11 06:47:12,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=623350.0, ans=0.125 2023-10-11 06:47:18,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-10-11 06:47:22,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.74 vs. limit=22.5 2023-10-11 06:47:26,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=623396.6666666666, ans=0.125 2023-10-11 06:47:37,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=623443.3333333334, ans=0.09899494936611666 2023-10-11 06:47:39,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=623443.3333333334, ans=0.0 2023-10-11 06:47:49,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=623490.0, ans=0.125 2023-10-11 06:47:52,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=623490.0, ans=0.2 2023-10-11 06:47:58,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=623536.6666666666, ans=0.125 2023-10-11 06:48:13,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.685e+02 1.855e+02 2.215e+02 3.002e+02, threshold=3.709e+02, percent-clipped=0.0 2023-10-11 06:48:16,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=623583.3333333334, ans=0.125 2023-10-11 06:48:16,989 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:48:24,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=623630.0, ans=0.0 2023-10-11 06:48:30,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=623630.0, ans=0.0 2023-10-11 06:48:38,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=623676.6666666666, ans=0.0 2023-10-11 06:48:43,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=623723.3333333334, ans=0.0 2023-10-11 06:48:55,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=623770.0, ans=0.2 2023-10-11 06:48:56,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=623770.0, ans=0.125 2023-10-11 06:48:58,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623770.0, ans=0.1 2023-10-11 06:49:17,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=623863.3333333334, ans=0.2 2023-10-11 06:49:33,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=623910.0, ans=0.0 2023-10-11 06:49:40,965 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:49:46,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=623956.6666666666, ans=0.125 2023-10-11 06:49:46,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=623956.6666666666, ans=0.125 2023-10-11 06:49:57,741 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:50:05,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=624050.0, ans=0.125 2023-10-11 06:50:06,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=624050.0, ans=0.2 2023-10-11 06:50:07,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.826e+02 2.002e+02 2.400e+02 3.195e+02, threshold=4.005e+02, percent-clipped=0.0 2023-10-11 06:50:09,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=624050.0, ans=0.125 2023-10-11 06:50:22,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=624096.6666666666, ans=0.2 2023-10-11 06:50:46,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624190.0, ans=0.125 2023-10-11 06:50:59,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=624236.6666666666, ans=0.125 2023-10-11 06:51:15,655 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-10-11 06:51:17,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=624330.0, ans=0.0 2023-10-11 06:51:23,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=624376.6666666666, ans=0.1 2023-10-11 06:51:33,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=624376.6666666666, ans=0.0 2023-10-11 06:51:52,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.85 vs. limit=22.5 2023-10-11 06:52:01,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.574e+02 1.743e+02 1.911e+02 2.805e+02, threshold=3.487e+02, percent-clipped=0.0 2023-10-11 06:52:07,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=624516.6666666666, ans=0.2 2023-10-11 06:52:19,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=15.0 2023-10-11 06:52:25,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=624610.0, ans=0.125 2023-10-11 06:52:25,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=12.0 2023-10-11 06:52:28,533 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:52:45,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=624703.3333333334, ans=0.0 2023-10-11 06:52:47,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=624703.3333333334, ans=0.125 2023-10-11 06:52:48,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=624703.3333333334, ans=0.125 2023-10-11 06:52:49,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=15.0 2023-10-11 06:53:04,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.86 vs. limit=10.0 2023-10-11 06:53:17,748 INFO [train.py:1031] (3/4) Epoch 10, batch 11000, loss[loss=0.2702, simple_loss=0.3344, pruned_loss=0.1031, over 15579.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2944, pruned_loss=0.05896, over 32671088.97 frames. ], batch size: 350, lr: 3.54e-03, grad_scale: 32.0 2023-10-11 06:53:27,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=624890.0, ans=0.0 2023-10-11 06:53:39,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=624936.6666666666, ans=0.1 2023-10-11 06:53:52,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.789e+02 2.013e+02 2.272e+02 3.229e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-11 06:53:52,498 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:53:57,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-10-11 06:54:31,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=625123.3333333334, ans=0.025 2023-10-11 06:54:34,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.71 vs. limit=15.0 2023-10-11 06:54:35,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=625123.3333333334, ans=0.0 2023-10-11 06:54:45,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=625170.0, ans=0.0 2023-10-11 06:54:50,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=625216.6666666666, ans=0.05 2023-10-11 06:54:52,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.99 vs. limit=8.0 2023-10-11 06:54:59,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=625263.3333333334, ans=0.0 2023-10-11 06:55:18,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=625310.0, ans=0.0 2023-10-11 06:55:34,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=625356.6666666666, ans=0.2 2023-10-11 06:55:34,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=12.0 2023-10-11 06:55:39,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=625403.3333333334, ans=0.0 2023-10-11 06:55:45,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625403.3333333334, ans=0.1 2023-10-11 06:55:52,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.596e+02 1.788e+02 1.923e+02 2.481e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 06:55:55,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.03 vs. limit=8.0 2023-10-11 06:55:57,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=625450.0, ans=0.2 2023-10-11 06:56:04,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=625496.6666666666, ans=0.125 2023-10-11 06:56:04,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=625496.6666666666, ans=0.0 2023-10-11 06:56:11,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=625496.6666666666, ans=0.09899494936611666 2023-10-11 06:56:15,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=625543.3333333334, ans=0.0 2023-10-11 06:56:23,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=625543.3333333334, ans=0.125 2023-10-11 06:56:30,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=625590.0, ans=0.125 2023-10-11 06:56:49,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=625683.3333333334, ans=0.2 2023-10-11 06:56:51,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=625683.3333333334, ans=0.125 2023-10-11 06:57:18,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=625776.6666666666, ans=0.0 2023-10-11 06:57:18,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=625776.6666666666, ans=0.0 2023-10-11 06:57:19,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=625823.3333333334, ans=0.2 2023-10-11 06:57:30,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.44 vs. limit=15.0 2023-10-11 06:57:32,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2023-10-11 06:57:34,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=625870.0, ans=0.125 2023-10-11 06:57:41,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=625916.6666666666, ans=0.125 2023-10-11 06:57:45,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.656e+02 1.855e+02 2.127e+02 2.897e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-11 06:58:33,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.98 vs. limit=15.0 2023-10-11 06:58:34,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=22.5 2023-10-11 06:58:39,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=626103.3333333334, ans=12.0 2023-10-11 06:59:08,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-11 06:59:12,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2023-10-11 06:59:17,943 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:59:27,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=626290.0, ans=0.2 2023-10-11 06:59:34,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=626336.6666666666, ans=0.0 2023-10-11 06:59:40,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-10-11 06:59:43,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.635e+02 1.829e+02 2.119e+02 3.020e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-11 06:59:47,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=626383.3333333334, ans=0.125 2023-10-11 06:59:47,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=626383.3333333334, ans=0.125 2023-10-11 06:59:50,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=626430.0, ans=0.125 2023-10-11 06:59:52,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-10-11 07:00:17,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=626523.3333333334, ans=0.2 2023-10-11 07:00:47,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-10-11 07:00:48,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=626663.3333333334, ans=0.0 2023-10-11 07:01:01,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=626710.0, ans=0.125 2023-10-11 07:01:01,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=626710.0, ans=0.125 2023-10-11 07:01:05,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-10-11 07:01:17,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=626756.6666666666, ans=0.2 2023-10-11 07:01:18,573 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:01:22,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=626803.3333333334, ans=0.125 2023-10-11 07:01:27,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=626803.3333333334, ans=0.025 2023-10-11 07:01:36,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.727e+02 1.987e+02 2.208e+02 2.705e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-11 07:01:40,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=626850.0, ans=0.125 2023-10-11 07:01:42,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2023-10-11 07:01:43,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=626850.0, ans=0.0 2023-10-11 07:02:11,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=15.0 2023-10-11 07:02:53,932 INFO [train.py:1031] (3/4) Epoch 10, batch 11500, loss[loss=0.1948, simple_loss=0.2938, pruned_loss=0.04795, over 16876.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.294, pruned_loss=0.05893, over 32677553.73 frames. ], batch size: 146, lr: 3.53e-03, grad_scale: 32.0 2023-10-11 07:02:56,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.58 vs. limit=15.0 2023-10-11 07:03:01,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=627176.6666666666, ans=0.125 2023-10-11 07:03:13,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=627223.3333333334, ans=0.0 2023-10-11 07:03:28,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=627316.6666666666, ans=0.125 2023-10-11 07:03:30,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.740e+02 1.876e+02 2.059e+02 2.578e+02, threshold=3.752e+02, percent-clipped=0.0 2023-10-11 07:03:47,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=627363.3333333334, ans=0.05 2023-10-11 07:04:11,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-11 07:04:37,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.98 vs. limit=22.5 2023-10-11 07:05:01,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=627643.3333333334, ans=0.0 2023-10-11 07:05:01,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=627643.3333333334, ans=0.1 2023-10-11 07:05:01,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=627643.3333333334, ans=0.0 2023-10-11 07:05:06,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=627690.0, ans=0.125 2023-10-11 07:05:18,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=627736.6666666666, ans=0.125 2023-10-11 07:05:25,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=627736.6666666666, ans=0.0 2023-10-11 07:05:26,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=627783.3333333334, ans=0.125 2023-10-11 07:05:31,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.655e+02 1.789e+02 1.958e+02 3.253e+02, threshold=3.578e+02, percent-clipped=0.0 2023-10-11 07:05:43,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627830.0, ans=0.1 2023-10-11 07:05:47,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=627876.6666666666, ans=0.125 2023-10-11 07:06:10,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=627970.0, ans=0.2 2023-10-11 07:06:12,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=627970.0, ans=0.0 2023-10-11 07:06:41,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=628110.0, ans=0.2 2023-10-11 07:06:47,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.60 vs. limit=10.0 2023-10-11 07:06:56,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=628156.6666666666, ans=0.125 2023-10-11 07:07:16,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.721e+02 1.918e+02 2.241e+02 3.517e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 07:07:51,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=628390.0, ans=0.1 2023-10-11 07:08:09,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=628436.6666666666, ans=0.125 2023-10-11 07:08:14,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-11 07:08:14,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=628436.6666666666, ans=0.1 2023-10-11 07:08:32,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=628530.0, ans=0.125 2023-10-11 07:09:23,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.641e+02 1.826e+02 2.042e+02 3.089e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 07:09:31,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628763.3333333334, ans=0.1 2023-10-11 07:09:35,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=628763.3333333334, ans=0.1 2023-10-11 07:10:12,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=628903.3333333334, ans=0.0 2023-10-11 07:10:24,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=628950.0, ans=0.125 2023-10-11 07:10:59,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=629090.0, ans=0.0 2023-10-11 07:11:22,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.835e+02 2.048e+02 2.340e+02 3.571e+02, threshold=4.096e+02, percent-clipped=0.0 2023-10-11 07:11:39,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=629230.0, ans=0.125 2023-10-11 07:11:41,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-10-11 07:11:41,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=629276.6666666666, ans=0.0 2023-10-11 07:11:42,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=629276.6666666666, ans=0.125 2023-10-11 07:11:48,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=629276.6666666666, ans=0.0 2023-10-11 07:11:48,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=629276.6666666666, ans=0.1 2023-10-11 07:12:12,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=629370.0, ans=0.5 2023-10-11 07:12:38,943 INFO [train.py:1031] (3/4) Epoch 10, batch 12000, loss[loss=0.2442, simple_loss=0.3215, pruned_loss=0.08345, over 16039.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2941, pruned_loss=0.05876, over 32697058.40 frames. ], batch size: 297, lr: 3.52e-03, grad_scale: 32.0 2023-10-11 07:12:57,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=629556.6666666666, ans=0.2 2023-10-11 07:13:15,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.55 vs. limit=15.0 2023-10-11 07:13:17,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.727e+02 1.897e+02 2.112e+02 3.010e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-11 07:13:20,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=629650.0, ans=0.125 2023-10-11 07:13:29,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=629696.6666666666, ans=0.0 2023-10-11 07:13:29,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=629696.6666666666, ans=0.0 2023-10-11 07:13:35,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=629743.3333333334, ans=0.0 2023-10-11 07:14:00,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=629836.6666666666, ans=0.07 2023-10-11 07:14:22,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=629883.3333333334, ans=0.2 2023-10-11 07:14:23,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=629930.0, ans=0.1 2023-10-11 07:14:34,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=629976.6666666666, ans=0.1 2023-10-11 07:14:46,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=630023.3333333334, ans=0.0 2023-10-11 07:14:51,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=630023.3333333334, ans=0.125 2023-10-11 07:14:54,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=630070.0, ans=0.2 2023-10-11 07:15:02,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-10-11 07:15:03,337 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:15:07,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=630116.6666666666, ans=0.0 2023-10-11 07:15:09,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.632e+02 1.789e+02 2.011e+02 3.021e+02, threshold=3.578e+02, percent-clipped=0.0 2023-10-11 07:15:14,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=630163.3333333334, ans=0.125 2023-10-11 07:15:14,661 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-10-11 07:15:23,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=630163.3333333334, ans=0.1 2023-10-11 07:15:30,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=630210.0, ans=0.1 2023-10-11 07:15:30,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2023-10-11 07:15:53,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=630303.3333333334, ans=0.125 2023-10-11 07:16:00,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=630350.0, ans=0.2 2023-10-11 07:16:03,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=630350.0, ans=0.125 2023-10-11 07:16:16,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=22.5 2023-10-11 07:16:25,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-10-11 07:16:44,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=630536.6666666666, ans=0.09899494936611666 2023-10-11 07:16:56,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=630583.3333333334, ans=0.0 2023-10-11 07:16:57,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.64 vs. limit=10.0 2023-10-11 07:16:59,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.705e+02 1.896e+02 2.246e+02 4.002e+02, threshold=3.792e+02, percent-clipped=2.0 2023-10-11 07:17:22,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=630676.6666666666, ans=0.1 2023-10-11 07:17:22,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=630676.6666666666, ans=0.0 2023-10-11 07:17:43,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=630770.0, ans=0.125 2023-10-11 07:18:26,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=15.0 2023-10-11 07:18:47,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=631050.0, ans=0.125 2023-10-11 07:18:52,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.716e+02 1.875e+02 2.192e+02 2.863e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-11 07:18:54,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=22.5 2023-10-11 07:19:06,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=631096.6666666666, ans=0.125 2023-10-11 07:19:20,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-10-11 07:19:26,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=631190.0, ans=0.2 2023-10-11 07:19:42,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=631236.6666666666, ans=0.0 2023-10-11 07:19:51,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=631283.3333333334, ans=0.125 2023-10-11 07:19:53,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=631283.3333333334, ans=0.125 2023-10-11 07:19:59,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=631330.0, ans=0.125 2023-10-11 07:19:59,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=631330.0, ans=10.0 2023-10-11 07:20:00,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-10-11 07:20:03,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=631330.0, ans=0.125 2023-10-11 07:20:20,542 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.72 vs. limit=22.5 2023-10-11 07:20:29,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.30 vs. limit=6.0 2023-10-11 07:20:46,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.681e+02 1.875e+02 2.101e+02 3.163e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 07:20:47,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=631516.6666666666, ans=0.0 2023-10-11 07:21:07,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=631610.0, ans=0.0 2023-10-11 07:21:14,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=631610.0, ans=0.125 2023-10-11 07:21:16,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.68 vs. limit=22.5 2023-10-11 07:21:17,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=631656.6666666666, ans=0.125 2023-10-11 07:21:27,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=631656.6666666666, ans=0.1 2023-10-11 07:21:33,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=631703.3333333334, ans=0.0 2023-10-11 07:21:40,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=631750.0, ans=0.0 2023-10-11 07:21:52,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-10-11 07:22:04,516 INFO [train.py:1031] (3/4) Epoch 10, batch 12500, loss[loss=0.2093, simple_loss=0.3015, pruned_loss=0.05857, over 16837.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2937, pruned_loss=0.05866, over 32715865.70 frames. ], batch size: 155, lr: 3.52e-03, grad_scale: 32.0 2023-10-11 07:22:08,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=631843.3333333334, ans=0.0 2023-10-11 07:22:42,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=631983.3333333334, ans=0.1 2023-10-11 07:22:42,783 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.669e+02 1.891e+02 2.113e+02 2.917e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-11 07:22:43,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=631983.3333333334, ans=0.125 2023-10-11 07:22:48,338 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=15.0 2023-10-11 07:22:48,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=632030.0, ans=0.125 2023-10-11 07:23:00,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=632076.6666666666, ans=0.125 2023-10-11 07:23:02,325 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:23:15,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-10-11 07:23:22,674 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=22.5 2023-10-11 07:23:29,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=15.0 2023-10-11 07:23:36,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-10-11 07:23:39,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=632216.6666666666, ans=0.0 2023-10-11 07:23:52,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=12.0 2023-10-11 07:23:52,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=632263.3333333334, ans=0.0 2023-10-11 07:23:57,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=632310.0, ans=0.04949747468305833 2023-10-11 07:24:08,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=632356.6666666666, ans=0.0 2023-10-11 07:24:15,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=632356.6666666666, ans=0.0 2023-10-11 07:24:34,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=632450.0, ans=6.0 2023-10-11 07:24:35,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.751e+02 1.971e+02 2.298e+02 3.391e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-11 07:24:52,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=632496.6666666666, ans=0.2 2023-10-11 07:25:11,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=632590.0, ans=0.125 2023-10-11 07:25:19,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=632636.6666666666, ans=0.125 2023-10-11 07:25:27,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=632636.6666666666, ans=0.125 2023-10-11 07:25:29,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=632683.3333333334, ans=0.125 2023-10-11 07:25:43,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=632730.0, ans=0.0 2023-10-11 07:25:51,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-11 07:25:56,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=632776.6666666666, ans=0.1 2023-10-11 07:26:04,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=632823.3333333334, ans=0.1 2023-10-11 07:26:04,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=632823.3333333334, ans=0.1 2023-10-11 07:26:14,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=632870.0, ans=0.2 2023-10-11 07:26:18,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=632870.0, ans=0.1 2023-10-11 07:26:25,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=632916.6666666666, ans=0.0 2023-10-11 07:26:26,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.622e+02 1.794e+02 2.072e+02 3.230e+02, threshold=3.587e+02, percent-clipped=0.0 2023-10-11 07:26:31,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=632916.6666666666, ans=0.0 2023-10-11 07:26:36,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.34 vs. limit=15.0 2023-10-11 07:26:41,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=632963.3333333334, ans=0.125 2023-10-11 07:26:44,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-10-11 07:26:54,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=633010.0, ans=0.0 2023-10-11 07:27:02,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=633056.6666666666, ans=0.125 2023-10-11 07:27:03,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=633056.6666666666, ans=0.0 2023-10-11 07:27:15,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=633103.3333333334, ans=0.125 2023-10-11 07:27:23,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=633150.0, ans=0.125 2023-10-11 07:27:35,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=633196.6666666666, ans=0.0 2023-10-11 07:27:36,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=633196.6666666666, ans=0.125 2023-10-11 07:27:54,958 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-10-11 07:28:07,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=633336.6666666666, ans=0.0 2023-10-11 07:28:15,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.731e+02 1.950e+02 2.312e+02 3.360e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-11 07:28:29,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=633430.0, ans=0.125 2023-10-11 07:28:53,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=633523.3333333334, ans=0.2 2023-10-11 07:29:00,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-10-11 07:29:15,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=633616.6666666666, ans=0.125 2023-10-11 07:29:22,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=633663.3333333334, ans=0.1 2023-10-11 07:29:26,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=633663.3333333334, ans=0.125 2023-10-11 07:29:42,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-10-11 07:29:44,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=633756.6666666666, ans=0.125 2023-10-11 07:29:48,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=633756.6666666666, ans=0.1 2023-10-11 07:30:07,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=633850.0, ans=0.1 2023-10-11 07:30:07,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.640e+02 1.798e+02 1.977e+02 2.764e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-11 07:30:19,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=633896.6666666666, ans=0.125 2023-10-11 07:30:24,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=633896.6666666666, ans=0.0 2023-10-11 07:30:27,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=633943.3333333334, ans=0.125 2023-10-11 07:30:36,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=633990.0, ans=0.125 2023-10-11 07:30:37,430 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:30:40,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=633990.0, ans=0.125 2023-10-11 07:30:49,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=634036.6666666666, ans=0.2 2023-10-11 07:30:52,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=634036.6666666666, ans=0.125 2023-10-11 07:31:15,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-10-11 07:31:17,287 INFO [train.py:1031] (3/4) Epoch 10, batch 13000, loss[loss=0.2072, simple_loss=0.2929, pruned_loss=0.06068, over 16586.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2942, pruned_loss=0.05876, over 32711825.25 frames. ], batch size: 61, lr: 3.51e-03, grad_scale: 32.0 2023-10-11 07:31:52,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=22.5 2023-10-11 07:32:01,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.644e+02 1.813e+02 1.965e+02 2.537e+02, threshold=3.627e+02, percent-clipped=0.0 2023-10-11 07:32:01,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=634316.6666666666, ans=0.07 2023-10-11 07:32:01,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-10-11 07:32:10,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-11 07:32:21,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=634410.0, ans=0.125 2023-10-11 07:32:48,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=634503.3333333334, ans=0.2 2023-10-11 07:32:54,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-10-11 07:33:09,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.81 vs. limit=10.0 2023-10-11 07:33:21,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=634643.3333333334, ans=0.125 2023-10-11 07:33:40,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=634690.0, ans=0.0 2023-10-11 07:33:42,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=634736.6666666666, ans=0.1 2023-10-11 07:33:53,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=634736.6666666666, ans=0.05 2023-10-11 07:33:56,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=634783.3333333334, ans=0.125 2023-10-11 07:33:58,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=634783.3333333334, ans=0.125 2023-10-11 07:34:00,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.712e+02 1.881e+02 2.185e+02 3.275e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 07:34:03,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=634783.3333333334, ans=0.0 2023-10-11 07:34:04,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.54 vs. limit=5.0 2023-10-11 07:34:13,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=634830.0, ans=0.05 2023-10-11 07:34:40,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=634970.0, ans=0.125 2023-10-11 07:34:59,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=635016.6666666666, ans=0.2 2023-10-11 07:35:17,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=635110.0, ans=10.0 2023-10-11 07:35:22,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=635110.0, ans=0.125 2023-10-11 07:35:25,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2023-10-11 07:35:26,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=635156.6666666666, ans=0.2 2023-10-11 07:35:27,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=635156.6666666666, ans=0.025 2023-10-11 07:35:28,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=635156.6666666666, ans=0.2 2023-10-11 07:35:29,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=635156.6666666666, ans=0.0 2023-10-11 07:35:41,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=635203.3333333334, ans=0.125 2023-10-11 07:35:42,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-11 07:35:53,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.718e+02 1.978e+02 2.196e+02 2.909e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-11 07:35:53,932 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:35:54,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=635250.0, ans=0.0 2023-10-11 07:35:58,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=635250.0, ans=0.125 2023-10-11 07:36:12,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=635343.3333333334, ans=0.125 2023-10-11 07:36:16,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=635343.3333333334, ans=0.0 2023-10-11 07:36:29,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=635390.0, ans=0.125 2023-10-11 07:36:29,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=635390.0, ans=0.0 2023-10-11 07:36:59,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=635530.0, ans=0.125 2023-10-11 07:37:00,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=635530.0, ans=15.0 2023-10-11 07:37:01,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=635530.0, ans=0.125 2023-10-11 07:37:09,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-11 07:37:13,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=635576.6666666666, ans=0.1 2023-10-11 07:37:26,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=635623.3333333334, ans=0.0 2023-10-11 07:37:31,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=635670.0, ans=0.125 2023-10-11 07:37:39,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=635670.0, ans=0.025 2023-10-11 07:37:42,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-11 07:37:46,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.802e+02 2.080e+02 2.362e+02 3.200e+02, threshold=4.160e+02, percent-clipped=0.0 2023-10-11 07:37:54,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=635763.3333333334, ans=0.125 2023-10-11 07:37:59,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=635763.3333333334, ans=0.125 2023-10-11 07:38:09,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=635810.0, ans=0.2 2023-10-11 07:38:15,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=635856.6666666666, ans=0.125 2023-10-11 07:38:27,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=635903.3333333334, ans=0.125 2023-10-11 07:38:31,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=635903.3333333334, ans=0.125 2023-10-11 07:38:39,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=635950.0, ans=0.1 2023-10-11 07:38:54,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=635996.6666666666, ans=0.0 2023-10-11 07:39:03,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-10-11 07:39:05,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=636043.3333333334, ans=0.125 2023-10-11 07:39:07,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=636043.3333333334, ans=0.0 2023-10-11 07:39:16,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.19 vs. limit=15.0 2023-10-11 07:39:24,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.42 vs. limit=10.0 2023-10-11 07:39:30,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=636136.6666666666, ans=0.125 2023-10-11 07:39:36,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.656e+02 1.798e+02 1.912e+02 2.514e+02, threshold=3.596e+02, percent-clipped=0.0 2023-10-11 07:39:43,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=636230.0, ans=0.125 2023-10-11 07:40:15,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=636370.0, ans=0.0 2023-10-11 07:40:16,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=15.0 2023-10-11 07:40:48,763 INFO [train.py:1031] (3/4) Epoch 10, batch 13500, loss[loss=0.2121, simple_loss=0.3008, pruned_loss=0.06169, over 16836.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2936, pruned_loss=0.05865, over 32730203.77 frames. ], batch size: 146, lr: 3.50e-03, grad_scale: 32.0 2023-10-11 07:41:06,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=636556.6666666666, ans=0.125 2023-10-11 07:41:08,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=636556.6666666666, ans=0.05 2023-10-11 07:41:08,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.20 vs. limit=15.0 2023-10-11 07:41:17,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=636603.3333333334, ans=0.125 2023-10-11 07:41:21,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-10-11 07:41:27,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.650e+02 1.883e+02 2.109e+02 3.516e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 07:41:37,065 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:41:48,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=636743.3333333334, ans=0.125 2023-10-11 07:41:52,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=636743.3333333334, ans=0.0 2023-10-11 07:41:59,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=636790.0, ans=0.1 2023-10-11 07:42:01,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=636790.0, ans=0.0 2023-10-11 07:42:15,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.91 vs. limit=22.5 2023-10-11 07:42:18,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=636883.3333333334, ans=0.2 2023-10-11 07:42:24,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.24 vs. limit=22.5 2023-10-11 07:42:39,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2023-10-11 07:42:43,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=636976.6666666666, ans=0.125 2023-10-11 07:42:45,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.30 vs. limit=10.0 2023-10-11 07:42:55,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=637023.3333333334, ans=0.125 2023-10-11 07:43:07,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=637070.0, ans=0.0 2023-10-11 07:43:08,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=637070.0, ans=0.125 2023-10-11 07:43:14,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.726e+02 1.894e+02 2.195e+02 3.015e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-11 07:43:18,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=637163.3333333334, ans=0.125 2023-10-11 07:43:30,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.03 vs. limit=15.0 2023-10-11 07:44:05,758 INFO [train.py:1031] (3/4) Epoch 11, batch 0, loss[loss=0.1758, simple_loss=0.2669, pruned_loss=0.04242, over 16948.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2669, pruned_loss=0.04242, over 16948.00 frames. ], batch size: 117, lr: 3.32e-03, grad_scale: 64.0 2023-10-11 07:44:05,759 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-11 07:44:14,259 INFO [train.py:1063] (3/4) Epoch 11, validation: loss=0.22, simple_loss=0.3069, pruned_loss=0.06655, over 1020973.00 frames. 2023-10-11 07:44:14,260 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-11 07:44:19,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=637233.3333333334, ans=0.0 2023-10-11 07:44:27,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-10-11 07:44:39,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637326.6666666666, ans=0.1 2023-10-11 07:44:45,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=637326.6666666666, ans=0.125 2023-10-11 07:44:46,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=637326.6666666666, ans=0.0 2023-10-11 07:44:52,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=637373.3333333334, ans=0.0 2023-10-11 07:45:00,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=637420.0, ans=0.125 2023-10-11 07:45:16,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637466.6666666666, ans=0.1 2023-10-11 07:45:17,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=637466.6666666666, ans=0.0 2023-10-11 07:45:28,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-10-11 07:45:43,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.754e+02 2.011e+02 2.352e+02 3.000e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-11 07:45:56,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=637653.3333333334, ans=0.025 2023-10-11 07:45:56,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=637653.3333333334, ans=0.2 2023-10-11 07:46:20,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.14 vs. limit=15.0 2023-10-11 07:47:34,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=638026.6666666666, ans=0.125 2023-10-11 07:47:35,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.321e+02 1.655e+02 1.855e+02 2.101e+02 3.603e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-11 07:47:37,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2023-10-11 07:47:53,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.16 vs. limit=22.5 2023-10-11 07:47:54,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=638120.0, ans=0.125 2023-10-11 07:47:56,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=638166.6666666666, ans=0.0 2023-10-11 07:47:57,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=638166.6666666666, ans=0.125 2023-10-11 07:48:18,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2023-10-11 07:48:22,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-11 07:48:23,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=638260.0, ans=0.125 2023-10-11 07:49:18,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=638493.3333333334, ans=0.1 2023-10-11 07:49:20,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=638493.3333333334, ans=0.05 2023-10-11 07:49:27,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.748e+02 1.867e+02 2.110e+02 3.204e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-11 07:49:46,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=638586.6666666666, ans=0.125 2023-10-11 07:49:53,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.50 vs. limit=22.5 2023-10-11 07:49:56,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=638633.3333333334, ans=0.0 2023-10-11 07:50:12,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=638726.6666666666, ans=0.0 2023-10-11 07:50:17,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=638726.6666666666, ans=0.1 2023-10-11 07:51:03,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=638913.3333333334, ans=0.125 2023-10-11 07:51:11,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=638960.0, ans=0.125 2023-10-11 07:51:13,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.712e+02 1.931e+02 2.200e+02 2.947e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 07:51:25,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=639053.3333333334, ans=0.125 2023-10-11 07:51:40,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=639100.0, ans=0.125 2023-10-11 07:51:45,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=12.0 2023-10-11 07:51:50,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=639146.6666666666, ans=0.025 2023-10-11 07:51:55,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=639146.6666666666, ans=0.0 2023-10-11 07:51:56,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-11 07:51:56,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=639146.6666666666, ans=0.125 2023-10-11 07:52:24,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=639286.6666666666, ans=0.125 2023-10-11 07:52:25,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=639286.6666666666, ans=0.125 2023-10-11 07:52:31,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=639333.3333333334, ans=0.125 2023-10-11 07:52:49,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.74 vs. limit=10.0 2023-10-11 07:52:52,000 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-10-11 07:52:55,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=639426.6666666666, ans=0.09899494936611666 2023-10-11 07:53:06,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=639426.6666666666, ans=0.2 2023-10-11 07:53:08,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.744e+02 1.959e+02 2.289e+02 3.637e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-11 07:53:20,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.47 vs. limit=22.5 2023-10-11 07:53:21,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=639520.0, ans=0.125 2023-10-11 07:53:31,703 INFO [train.py:1031] (3/4) Epoch 11, batch 500, loss[loss=0.1956, simple_loss=0.2935, pruned_loss=0.04882, over 16884.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2934, pruned_loss=0.05845, over 7291272.43 frames. ], batch size: 104, lr: 3.32e-03, grad_scale: 16.0 2023-10-11 07:53:35,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-11 07:53:40,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=639613.3333333334, ans=0.0 2023-10-11 07:54:03,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=639706.6666666666, ans=0.2 2023-10-11 07:54:15,208 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-10-11 07:54:28,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=639800.0, ans=0.0 2023-10-11 07:54:36,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=639846.6666666666, ans=0.125 2023-10-11 07:54:39,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=639846.6666666666, ans=0.05 2023-10-11 07:54:50,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=639893.3333333334, ans=0.125 2023-10-11 07:54:51,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=639893.3333333334, ans=0.125 2023-10-11 07:54:55,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=639893.3333333334, ans=0.125 2023-10-11 07:54:58,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=639940.0, ans=0.125 2023-10-11 07:54:58,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.761e+02 1.926e+02 2.143e+02 2.774e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-11 07:54:59,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=639940.0, ans=0.125 2023-10-11 07:55:08,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=639986.6666666666, ans=0.0 2023-10-11 07:55:17,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=639986.6666666666, ans=0.1 2023-10-11 07:55:30,575 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-11 07:55:54,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-10-11 07:56:04,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=640220.0, ans=0.125 2023-10-11 07:56:24,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=640313.3333333334, ans=0.0 2023-10-11 07:56:45,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=640406.6666666666, ans=10.0 2023-10-11 07:56:45,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640406.6666666666, ans=0.1 2023-10-11 07:56:47,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.708e+02 1.873e+02 2.125e+02 3.173e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-11 07:57:09,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=640500.0, ans=0.125 2023-10-11 07:57:18,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=640546.6666666666, ans=0.125 2023-10-11 07:57:20,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-10-11 07:57:51,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=640686.6666666666, ans=0.0 2023-10-11 07:57:55,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=640686.6666666666, ans=0.125 2023-10-11 07:57:58,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=640686.6666666666, ans=0.125 2023-10-11 07:58:01,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=12.0 2023-10-11 07:58:18,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=640780.0, ans=0.0 2023-10-11 07:58:39,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.748e+02 1.934e+02 2.275e+02 4.012e+02, threshold=3.867e+02, percent-clipped=1.0 2023-10-11 07:58:52,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2023-10-11 07:58:55,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=640920.0, ans=0.2 2023-10-11 07:59:04,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=640966.6666666666, ans=0.125 2023-10-11 07:59:06,664 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=3.114e-02 2023-10-11 07:59:20,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=641013.3333333334, ans=0.125 2023-10-11 07:59:27,454 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.47 vs. limit=15.0 2023-10-11 07:59:37,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=641106.6666666666, ans=0.0 2023-10-11 07:59:50,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=641153.3333333334, ans=0.125 2023-10-11 08:00:30,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=641293.3333333334, ans=0.125 2023-10-11 08:00:37,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.673e+02 1.803e+02 2.035e+02 2.555e+02, threshold=3.607e+02, percent-clipped=0.0 2023-10-11 08:00:43,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=641340.0, ans=0.1 2023-10-11 08:00:50,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=641386.6666666666, ans=0.125 2023-10-11 08:00:53,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=641386.6666666666, ans=0.125 2023-10-11 08:00:54,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=641386.6666666666, ans=0.0 2023-10-11 08:00:58,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-10-11 08:01:09,661 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-11 08:01:26,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=641526.6666666666, ans=0.125 2023-10-11 08:01:26,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2023-10-11 08:01:30,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=641573.3333333334, ans=0.125 2023-10-11 08:01:30,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=641573.3333333334, ans=0.125 2023-10-11 08:01:56,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=641666.6666666666, ans=0.2 2023-10-11 08:02:11,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=641713.3333333334, ans=0.5 2023-10-11 08:02:11,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=641713.3333333334, ans=0.2 2023-10-11 08:02:14,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=641713.3333333334, ans=0.0 2023-10-11 08:02:17,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=641760.0, ans=0.125 2023-10-11 08:02:27,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-11 08:02:28,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.545e+02 1.759e+02 1.878e+02 3.047e+02, threshold=3.518e+02, percent-clipped=0.0 2023-10-11 08:02:29,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=22.5 2023-10-11 08:02:41,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=641853.3333333334, ans=0.125 2023-10-11 08:02:48,695 INFO [train.py:1031] (3/4) Epoch 11, batch 1000, loss[loss=0.1937, simple_loss=0.2931, pruned_loss=0.0472, over 16847.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2945, pruned_loss=0.05867, over 12955372.57 frames. ], batch size: 87, lr: 3.31e-03, grad_scale: 32.0 2023-10-11 08:03:03,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=641946.6666666666, ans=0.125 2023-10-11 08:03:18,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=641993.3333333334, ans=0.125 2023-10-11 08:03:33,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=642086.6666666666, ans=0.0 2023-10-11 08:04:02,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=642226.6666666666, ans=0.125 2023-10-11 08:04:13,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.693e+02 1.909e+02 2.260e+02 3.030e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-11 08:04:19,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=642273.3333333334, ans=0.2 2023-10-11 08:04:20,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=642273.3333333334, ans=0.0 2023-10-11 08:04:30,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=642320.0, ans=0.125 2023-10-11 08:04:32,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642320.0, ans=0.1 2023-10-11 08:04:35,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=642366.6666666666, ans=0.125 2023-10-11 08:04:36,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=642366.6666666666, ans=0.125 2023-10-11 08:04:40,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.82 vs. limit=15.0 2023-10-11 08:05:13,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=642506.6666666666, ans=0.0 2023-10-11 08:05:24,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=642553.3333333334, ans=0.125 2023-10-11 08:05:37,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=642600.0, ans=0.125 2023-10-11 08:05:38,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=642600.0, ans=0.125 2023-10-11 08:05:38,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=642600.0, ans=0.125 2023-10-11 08:05:50,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.65 vs. limit=10.0 2023-10-11 08:06:02,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=642693.3333333334, ans=0.125 2023-10-11 08:06:04,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=642693.3333333334, ans=0.125 2023-10-11 08:06:12,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642693.3333333334, ans=0.1 2023-10-11 08:06:12,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=15.0 2023-10-11 08:06:17,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.690e+02 1.863e+02 2.133e+02 2.919e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 08:06:20,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-10-11 08:06:28,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=642786.6666666666, ans=0.125 2023-10-11 08:06:44,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=642833.3333333334, ans=0.0 2023-10-11 08:06:47,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=642833.3333333334, ans=0.125 2023-10-11 08:07:08,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.71 vs. limit=22.5 2023-10-11 08:07:10,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=642973.3333333334, ans=0.0 2023-10-11 08:07:14,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=642973.3333333334, ans=0.125 2023-10-11 08:07:16,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=642973.3333333334, ans=0.125 2023-10-11 08:07:31,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=643020.0, ans=15.0 2023-10-11 08:07:35,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=643066.6666666666, ans=0.2 2023-10-11 08:07:43,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=643113.3333333334, ans=0.125 2023-10-11 08:07:50,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=643113.3333333334, ans=0.125 2023-10-11 08:07:53,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=643160.0, ans=0.125 2023-10-11 08:08:03,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=643160.0, ans=0.0 2023-10-11 08:08:06,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.289e+02 1.633e+02 1.809e+02 2.052e+02 3.009e+02, threshold=3.618e+02, percent-clipped=0.0 2023-10-11 08:08:18,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.69 vs. limit=10.0 2023-10-11 08:08:32,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=643300.0, ans=0.0 2023-10-11 08:08:58,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=643393.3333333334, ans=0.95 2023-10-11 08:09:05,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.52 vs. limit=10.0 2023-10-11 08:09:12,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=643486.6666666666, ans=0.09899494936611666 2023-10-11 08:09:38,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=643580.0, ans=0.125 2023-10-11 08:09:40,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=643626.6666666666, ans=0.125 2023-10-11 08:09:43,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=643626.6666666666, ans=0.0 2023-10-11 08:09:44,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=643626.6666666666, ans=0.2 2023-10-11 08:09:44,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=643626.6666666666, ans=0.125 2023-10-11 08:09:54,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.635e+02 1.801e+02 2.048e+02 2.753e+02, threshold=3.602e+02, percent-clipped=0.0 2023-10-11 08:10:01,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.86 vs. limit=22.5 2023-10-11 08:10:13,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-11 08:10:14,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=643766.6666666666, ans=0.95 2023-10-11 08:10:26,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.14 vs. limit=22.5 2023-10-11 08:10:32,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=643813.3333333334, ans=0.125 2023-10-11 08:10:40,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=643860.0, ans=0.125 2023-10-11 08:10:41,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=643860.0, ans=0.125 2023-10-11 08:10:58,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=643906.6666666666, ans=0.09899494936611666 2023-10-11 08:10:59,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=643906.6666666666, ans=0.95 2023-10-11 08:10:59,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=643906.6666666666, ans=0.0 2023-10-11 08:11:01,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=643953.3333333334, ans=0.2 2023-10-11 08:11:03,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=643953.3333333334, ans=0.125 2023-10-11 08:11:06,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=643953.3333333334, ans=0.0 2023-10-11 08:11:13,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=644000.0, ans=0.1 2023-10-11 08:11:32,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-10-11 08:11:33,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=644046.6666666666, ans=0.0 2023-10-11 08:11:36,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-10-11 08:11:46,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.695e+02 1.901e+02 2.189e+02 3.215e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-11 08:12:11,759 INFO [train.py:1031] (3/4) Epoch 11, batch 1500, loss[loss=0.1985, simple_loss=0.2829, pruned_loss=0.05707, over 16978.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2924, pruned_loss=0.05758, over 17376002.82 frames. ], batch size: 123, lr: 3.31e-03, grad_scale: 32.0 2023-10-11 08:12:24,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=644280.0, ans=0.1 2023-10-11 08:12:25,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=644280.0, ans=0.0 2023-10-11 08:12:50,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=644373.3333333334, ans=0.0 2023-10-11 08:13:39,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-11 08:13:41,928 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.47 vs. limit=15.0 2023-10-11 08:13:42,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.743e+02 1.906e+02 2.177e+02 3.390e+02, threshold=3.812e+02, percent-clipped=0.0 2023-10-11 08:13:44,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-10-11 08:13:53,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644653.3333333334, ans=0.1 2023-10-11 08:13:58,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=644653.3333333334, ans=0.125 2023-10-11 08:14:04,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=644700.0, ans=0.125 2023-10-11 08:14:25,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=644793.3333333334, ans=0.0 2023-10-11 08:14:32,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=15.0 2023-10-11 08:15:10,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=644933.3333333334, ans=0.035 2023-10-11 08:15:13,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.58 vs. limit=12.0 2023-10-11 08:15:19,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2023-10-11 08:15:26,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644980.0, ans=0.1 2023-10-11 08:15:31,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=645026.6666666666, ans=0.1 2023-10-11 08:15:41,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.672e+02 1.834e+02 1.966e+02 3.140e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-11 08:15:45,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=645073.3333333334, ans=0.1 2023-10-11 08:15:57,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=645120.0, ans=0.0 2023-10-11 08:15:59,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2023-10-11 08:16:13,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-10-11 08:16:21,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=645260.0, ans=0.125 2023-10-11 08:16:22,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=15.0 2023-10-11 08:16:26,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=645260.0, ans=0.09899494936611666 2023-10-11 08:16:33,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-10-11 08:16:35,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-11 08:16:54,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=22.5 2023-10-11 08:17:00,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=645400.0, ans=0.125 2023-10-11 08:17:31,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.315e+02 1.697e+02 1.914e+02 2.087e+02 2.965e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 08:17:40,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=15.0 2023-10-11 08:17:44,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-11 08:18:08,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=645633.3333333334, ans=0.1 2023-10-11 08:18:09,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=645680.0, ans=0.0 2023-10-11 08:18:25,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=645726.6666666666, ans=0.125 2023-10-11 08:18:39,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-11 08:18:46,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=645820.0, ans=0.125 2023-10-11 08:19:03,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=645866.6666666666, ans=0.125 2023-10-11 08:19:07,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=645913.3333333334, ans=0.1 2023-10-11 08:19:08,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=645913.3333333334, ans=0.2 2023-10-11 08:19:22,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.48 vs. limit=15.0 2023-10-11 08:19:25,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=645960.0, ans=0.125 2023-10-11 08:19:27,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.719e+02 1.956e+02 2.144e+02 3.870e+02, threshold=3.911e+02, percent-clipped=1.0 2023-10-11 08:19:35,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=646006.6666666666, ans=0.125 2023-10-11 08:19:45,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=646053.3333333334, ans=0.0 2023-10-11 08:19:54,342 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:20:20,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=646240.0, ans=0.0 2023-10-11 08:20:36,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=646286.6666666666, ans=0.125 2023-10-11 08:20:58,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-11 08:20:59,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-10-11 08:21:17,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=646380.0, ans=0.2 2023-10-11 08:21:34,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.01 vs. limit=22.5 2023-10-11 08:21:35,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.660e+02 1.812e+02 2.162e+02 3.366e+02, threshold=3.625e+02, percent-clipped=0.0 2023-10-11 08:21:44,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=646520.0, ans=0.125 2023-10-11 08:21:57,679 INFO [train.py:1031] (3/4) Epoch 11, batch 2000, loss[loss=0.2058, simple_loss=0.3018, pruned_loss=0.05488, over 16926.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.293, pruned_loss=0.05782, over 20780091.62 frames. ], batch size: 110, lr: 3.30e-03, grad_scale: 32.0 2023-10-11 08:22:02,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=646566.6666666666, ans=0.0 2023-10-11 08:22:53,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=646753.3333333334, ans=0.2 2023-10-11 08:23:13,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=646800.0, ans=0.1 2023-10-11 08:23:43,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.659e+02 1.847e+02 2.128e+02 3.565e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-11 08:23:45,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=646940.0, ans=0.09899494936611666 2023-10-11 08:23:56,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=646986.6666666666, ans=0.1 2023-10-11 08:24:06,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=647033.3333333334, ans=0.025 2023-10-11 08:24:17,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=647033.3333333334, ans=0.0 2023-10-11 08:24:24,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=647033.3333333334, ans=0.2 2023-10-11 08:24:28,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=647080.0, ans=0.125 2023-10-11 08:24:33,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647080.0, ans=0.1 2023-10-11 08:25:01,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=647173.3333333334, ans=0.0 2023-10-11 08:25:01,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=647173.3333333334, ans=0.0 2023-10-11 08:25:20,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=647220.0, ans=0.0 2023-10-11 08:25:36,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=647313.3333333334, ans=0.125 2023-10-11 08:25:37,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=647313.3333333334, ans=0.0 2023-10-11 08:25:37,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=647313.3333333334, ans=0.125 2023-10-11 08:25:38,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=647313.3333333334, ans=0.0 2023-10-11 08:25:54,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647360.0, ans=0.1 2023-10-11 08:25:54,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.59 vs. limit=5.0 2023-10-11 08:26:00,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.769e+02 1.954e+02 2.323e+02 3.601e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-11 08:26:05,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=647406.6666666666, ans=0.125 2023-10-11 08:26:09,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-10-11 08:26:10,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=647453.3333333334, ans=0.125 2023-10-11 08:26:10,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=647453.3333333334, ans=0.2 2023-10-11 08:26:11,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=647453.3333333334, ans=0.0 2023-10-11 08:26:11,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=647453.3333333334, ans=0.125 2023-10-11 08:26:23,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=647500.0, ans=12.0 2023-10-11 08:26:27,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=647500.0, ans=0.0 2023-10-11 08:26:31,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=647546.6666666666, ans=0.0 2023-10-11 08:26:35,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=647546.6666666666, ans=0.125 2023-10-11 08:26:38,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-10-11 08:26:42,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=647593.3333333334, ans=15.0 2023-10-11 08:26:43,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=647593.3333333334, ans=0.025 2023-10-11 08:27:47,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=647873.3333333334, ans=0.0 2023-10-11 08:27:49,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=647873.3333333334, ans=0.125 2023-10-11 08:27:49,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.699e+02 1.899e+02 2.245e+02 2.904e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-11 08:27:53,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=647873.3333333334, ans=0.0 2023-10-11 08:28:13,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=647966.6666666666, ans=0.0 2023-10-11 08:28:14,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=647966.6666666666, ans=0.07 2023-10-11 08:28:31,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-10-11 08:28:57,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.45 vs. limit=15.0 2023-10-11 08:29:24,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=648246.6666666666, ans=0.125 2023-10-11 08:29:41,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.698e+02 1.826e+02 1.983e+02 2.729e+02, threshold=3.651e+02, percent-clipped=0.0 2023-10-11 08:29:43,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=648340.0, ans=0.04949747468305833 2023-10-11 08:29:49,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=648386.6666666666, ans=0.125 2023-10-11 08:29:58,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-11 08:30:11,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=648480.0, ans=0.125 2023-10-11 08:30:26,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=648526.6666666666, ans=0.0 2023-10-11 08:30:42,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=648620.0, ans=15.0 2023-10-11 08:30:45,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=648620.0, ans=0.05 2023-10-11 08:30:49,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=648620.0, ans=0.125 2023-10-11 08:30:50,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=648620.0, ans=0.125 2023-10-11 08:31:02,614 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=15.0 2023-10-11 08:31:17,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=648760.0, ans=0.035 2023-10-11 08:31:27,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.685e+02 1.871e+02 2.097e+02 2.995e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 08:31:35,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.01 vs. limit=15.0 2023-10-11 08:31:37,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=648853.3333333334, ans=0.125 2023-10-11 08:31:43,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=648853.3333333334, ans=0.2 2023-10-11 08:31:45,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=648853.3333333334, ans=0.125 2023-10-11 08:31:47,621 INFO [train.py:1031] (3/4) Epoch 11, batch 2500, loss[loss=0.2065, simple_loss=0.3022, pruned_loss=0.05543, over 16834.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2931, pruned_loss=0.05784, over 23461327.03 frames. ], batch size: 87, lr: 3.29e-03, grad_scale: 32.0 2023-10-11 08:31:48,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.40 vs. limit=6.0 2023-10-11 08:31:51,723 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:31:56,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=648900.0, ans=0.0 2023-10-11 08:32:02,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=648946.6666666666, ans=10.0 2023-10-11 08:32:08,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=648993.3333333334, ans=0.0 2023-10-11 08:32:10,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=648993.3333333334, ans=0.0 2023-10-11 08:32:15,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=22.5 2023-10-11 08:32:53,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.03 vs. limit=22.5 2023-10-11 08:32:56,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=649180.0, ans=0.125 2023-10-11 08:33:10,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2023-10-11 08:33:17,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.733e+02 1.893e+02 2.172e+02 3.079e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-11 08:33:28,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-11 08:33:32,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.68 vs. limit=15.0 2023-10-11 08:34:37,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=649600.0, ans=0.125 2023-10-11 08:34:47,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.37 vs. limit=12.0 2023-10-11 08:35:01,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=649693.3333333334, ans=0.07 2023-10-11 08:35:06,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649740.0, ans=0.1 2023-10-11 08:35:08,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.715e+02 1.917e+02 2.322e+02 3.352e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 08:35:23,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=649786.6666666666, ans=0.125 2023-10-11 08:35:47,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=649880.0, ans=0.0 2023-10-11 08:35:48,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=649880.0, ans=0.125 2023-10-11 08:35:53,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=649926.6666666666, ans=0.0 2023-10-11 08:35:55,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-10-11 08:36:03,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=649973.3333333334, ans=0.07 2023-10-11 08:36:03,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=649973.3333333334, ans=0.125 2023-10-11 08:36:04,047 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:36:10,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649973.3333333334, ans=0.1 2023-10-11 08:37:03,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=650160.0, ans=0.125 2023-10-11 08:37:08,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.687e+02 1.856e+02 2.162e+02 3.629e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 08:37:10,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=650206.6666666666, ans=0.125 2023-10-11 08:37:50,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=650346.6666666666, ans=0.09899494936611666 2023-10-11 08:37:52,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=650393.3333333334, ans=0.0 2023-10-11 08:38:18,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=650486.6666666666, ans=0.125 2023-10-11 08:38:38,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=12.0 2023-10-11 08:38:42,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=650580.0, ans=0.1 2023-10-11 08:38:53,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-10-11 08:39:06,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-11 08:39:11,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.283e+02 1.735e+02 1.922e+02 2.122e+02 3.124e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-11 08:39:15,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.27 vs. limit=15.0 2023-10-11 08:39:18,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=650673.3333333334, ans=0.125 2023-10-11 08:39:18,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=650673.3333333334, ans=0.2 2023-10-11 08:39:18,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=650673.3333333334, ans=10.0 2023-10-11 08:39:18,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.15 vs. limit=15.0 2023-10-11 08:39:20,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=650720.0, ans=0.125 2023-10-11 08:39:28,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-10-11 08:39:46,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.33 vs. limit=10.0 2023-10-11 08:39:51,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=650813.3333333334, ans=0.125 2023-10-11 08:39:55,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=650813.3333333334, ans=0.125 2023-10-11 08:39:58,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=650860.0, ans=0.125 2023-10-11 08:40:11,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=650906.6666666666, ans=0.125 2023-10-11 08:40:17,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=650953.3333333334, ans=0.125 2023-10-11 08:40:42,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.08 vs. limit=15.0 2023-10-11 08:40:48,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-10-11 08:40:50,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.43 vs. limit=22.5 2023-10-11 08:41:03,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=651140.0, ans=0.125 2023-10-11 08:41:04,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.658e+02 1.870e+02 2.054e+02 2.597e+02, threshold=3.740e+02, percent-clipped=0.0 2023-10-11 08:41:11,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=651186.6666666666, ans=0.125 2023-10-11 08:41:23,450 INFO [train.py:1031] (3/4) Epoch 11, batch 3000, loss[loss=0.1957, simple_loss=0.2863, pruned_loss=0.05257, over 16560.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2927, pruned_loss=0.05814, over 25541668.66 frames. ], batch size: 56, lr: 3.29e-03, grad_scale: 32.0 2023-10-11 08:41:24,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=651233.3333333334, ans=0.2 2023-10-11 08:41:26,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=651233.3333333334, ans=10.0 2023-10-11 08:42:04,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=651373.3333333334, ans=0.07 2023-10-11 08:42:10,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=651420.0, ans=0.0 2023-10-11 08:42:10,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.63 vs. limit=22.5 2023-10-11 08:42:13,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=651420.0, ans=0.0 2023-10-11 08:42:21,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=651466.6666666666, ans=0.0 2023-10-11 08:42:25,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651466.6666666666, ans=0.1 2023-10-11 08:42:31,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=651513.3333333334, ans=0.2 2023-10-11 08:42:31,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=651513.3333333334, ans=0.025 2023-10-11 08:42:42,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=651560.0, ans=0.2 2023-10-11 08:42:45,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=651560.0, ans=0.125 2023-10-11 08:42:48,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=651560.0, ans=0.1 2023-10-11 08:42:49,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=651606.6666666666, ans=0.2 2023-10-11 08:42:52,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.771e+02 1.903e+02 2.092e+02 3.111e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-11 08:43:00,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.17 vs. limit=15.0 2023-10-11 08:43:10,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651653.3333333334, ans=0.1 2023-10-11 08:43:16,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.49 vs. limit=10.0 2023-10-11 08:43:17,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=651700.0, ans=0.125 2023-10-11 08:43:30,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651746.6666666666, ans=0.1 2023-10-11 08:43:32,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-11 08:43:55,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=651840.0, ans=0.0 2023-10-11 08:44:34,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.48 vs. limit=6.0 2023-10-11 08:44:35,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=652026.6666666666, ans=0.2 2023-10-11 08:44:39,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=15.0 2023-10-11 08:44:45,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.678e+02 1.871e+02 2.178e+02 3.093e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 08:44:48,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=652073.3333333334, ans=0.0 2023-10-11 08:44:52,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=652073.3333333334, ans=0.1 2023-10-11 08:44:56,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=12.0 2023-10-11 08:45:30,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=652260.0, ans=0.1 2023-10-11 08:45:47,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=652306.6666666666, ans=0.1 2023-10-11 08:45:49,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=652306.6666666666, ans=0.125 2023-10-11 08:45:50,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=652306.6666666666, ans=0.2 2023-10-11 08:45:50,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=652306.6666666666, ans=0.125 2023-10-11 08:45:58,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=652353.3333333334, ans=0.0 2023-10-11 08:46:19,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-11 08:46:24,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-11 08:46:38,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=652493.3333333334, ans=0.0 2023-10-11 08:46:47,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.661e+02 1.872e+02 2.041e+02 3.337e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 08:46:48,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=652540.0, ans=0.125 2023-10-11 08:46:55,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=652586.6666666666, ans=0.125 2023-10-11 08:46:57,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=15.0 2023-10-11 08:47:15,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.53 vs. limit=12.0 2023-10-11 08:47:20,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=652680.0, ans=0.1 2023-10-11 08:47:21,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=652680.0, ans=0.125 2023-10-11 08:47:22,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=652680.0, ans=15.0 2023-10-11 08:47:22,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=15.0 2023-10-11 08:47:32,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=652726.6666666666, ans=0.2 2023-10-11 08:47:33,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=652726.6666666666, ans=0.2 2023-10-11 08:47:49,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=652773.3333333334, ans=0.2 2023-10-11 08:48:07,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=652866.6666666666, ans=0.0 2023-10-11 08:48:12,370 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:48:16,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-10-11 08:48:20,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=652913.3333333334, ans=0.125 2023-10-11 08:48:20,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=652913.3333333334, ans=0.0 2023-10-11 08:48:25,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=652960.0, ans=0.0 2023-10-11 08:48:40,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.672e+02 1.857e+02 2.103e+02 3.532e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-11 08:48:48,321 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:48:52,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=653053.3333333334, ans=0.09899494936611666 2023-10-11 08:48:58,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=653053.3333333334, ans=0.125 2023-10-11 08:48:58,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-10-11 08:49:02,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=653100.0, ans=0.0 2023-10-11 08:49:06,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=653100.0, ans=0.125 2023-10-11 08:49:16,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=653146.6666666666, ans=0.125 2023-10-11 08:49:28,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=653193.3333333334, ans=0.125 2023-10-11 08:49:42,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=653240.0, ans=0.0 2023-10-11 08:49:42,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-10-11 08:49:44,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=653286.6666666666, ans=0.125 2023-10-11 08:49:48,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=22.5 2023-10-11 08:49:48,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=22.5 2023-10-11 08:49:58,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=653333.3333333334, ans=0.0 2023-10-11 08:50:00,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653333.3333333334, ans=0.1 2023-10-11 08:50:18,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=653426.6666666666, ans=0.2 2023-10-11 08:50:29,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.732e+02 1.893e+02 2.224e+02 3.725e+02, threshold=3.786e+02, percent-clipped=1.0 2023-10-11 08:50:50,821 INFO [train.py:1031] (3/4) Epoch 11, batch 3500, loss[loss=0.1994, simple_loss=0.2655, pruned_loss=0.0666, over 12383.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2928, pruned_loss=0.05823, over 27168897.65 frames. ], batch size: 440, lr: 3.28e-03, grad_scale: 16.0 2023-10-11 08:50:51,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.30 vs. limit=15.0 2023-10-11 08:51:16,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=653660.0, ans=0.0 2023-10-11 08:51:32,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=653706.6666666666, ans=0.125 2023-10-11 08:51:32,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=653706.6666666666, ans=0.2 2023-10-11 08:51:35,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=653753.3333333334, ans=0.125 2023-10-11 08:51:44,671 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=12.0 2023-10-11 08:51:51,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=653800.0, ans=0.125 2023-10-11 08:51:51,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=22.5 2023-10-11 08:52:00,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=653846.6666666666, ans=0.0 2023-10-11 08:52:24,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=653940.0, ans=0.125 2023-10-11 08:52:28,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.821e+02 1.990e+02 2.253e+02 3.874e+02, threshold=3.980e+02, percent-clipped=1.0 2023-10-11 08:52:30,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=653940.0, ans=0.125 2023-10-11 08:52:30,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=653940.0, ans=0.125 2023-10-11 08:52:36,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=653986.6666666666, ans=0.0 2023-10-11 08:52:49,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-10-11 08:52:51,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=654033.3333333334, ans=0.2 2023-10-11 08:53:13,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=654126.6666666666, ans=0.07 2023-10-11 08:53:14,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=654126.6666666666, ans=0.1 2023-10-11 08:53:23,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=654173.3333333334, ans=0.125 2023-10-11 08:53:24,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=654173.3333333334, ans=0.0 2023-10-11 08:53:33,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=654173.3333333334, ans=0.125 2023-10-11 08:53:36,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=654220.0, ans=0.04949747468305833 2023-10-11 08:53:46,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=654220.0, ans=0.2 2023-10-11 08:53:47,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=654266.6666666666, ans=10.0 2023-10-11 08:54:03,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=654313.3333333334, ans=0.125 2023-10-11 08:54:08,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=654313.3333333334, ans=0.0 2023-10-11 08:54:15,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2023-10-11 08:54:25,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.659e+02 1.908e+02 2.167e+02 3.702e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 08:54:31,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=654406.6666666666, ans=0.1 2023-10-11 08:54:31,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=654406.6666666666, ans=0.0 2023-10-11 08:54:36,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=654453.3333333334, ans=0.2 2023-10-11 08:54:38,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=654453.3333333334, ans=0.0 2023-10-11 08:54:56,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=654500.0, ans=0.0 2023-10-11 08:54:57,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=654546.6666666666, ans=0.125 2023-10-11 08:54:58,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=654546.6666666666, ans=0.2 2023-10-11 08:55:12,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=654593.3333333334, ans=0.125 2023-10-11 08:55:26,044 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:55:36,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=654640.0, ans=0.125 2023-10-11 08:56:12,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-10-11 08:56:28,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.676e+02 1.839e+02 2.080e+02 2.698e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-11 08:56:28,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=654873.3333333334, ans=0.09899494936611666 2023-10-11 08:56:31,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=654873.3333333334, ans=0.2 2023-10-11 08:56:33,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=654873.3333333334, ans=0.125 2023-10-11 08:57:07,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.54 vs. limit=10.0 2023-10-11 08:57:21,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=655106.6666666666, ans=0.0 2023-10-11 08:57:31,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=655153.3333333334, ans=0.0 2023-10-11 08:57:34,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=655153.3333333334, ans=0.125 2023-10-11 08:57:43,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.70 vs. limit=22.5 2023-10-11 08:57:47,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=655200.0, ans=0.2 2023-10-11 08:57:52,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=655246.6666666666, ans=0.0 2023-10-11 08:57:52,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.45 vs. limit=15.0 2023-10-11 08:57:54,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=655246.6666666666, ans=0.0 2023-10-11 08:58:00,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=655246.6666666666, ans=0.125 2023-10-11 08:58:04,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.28 vs. limit=15.0 2023-10-11 08:58:15,973 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:58:17,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.602e+02 1.791e+02 1.912e+02 2.975e+02, threshold=3.581e+02, percent-clipped=0.0 2023-10-11 08:58:46,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2023-10-11 08:59:17,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=655573.3333333334, ans=12.0 2023-10-11 08:59:36,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=655666.6666666666, ans=0.2 2023-10-11 08:59:53,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=655760.0, ans=0.125 2023-10-11 09:00:03,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=655806.6666666666, ans=0.125 2023-10-11 09:00:08,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.651e+02 1.830e+02 2.035e+02 2.608e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-11 09:00:17,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=655853.3333333334, ans=0.125 2023-10-11 09:00:26,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-11 09:00:26,815 INFO [train.py:1031] (3/4) Epoch 11, batch 4000, loss[loss=0.1991, simple_loss=0.2843, pruned_loss=0.05702, over 16427.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2925, pruned_loss=0.05842, over 28393794.26 frames. ], batch size: 50, lr: 3.28e-03, grad_scale: 32.0 2023-10-11 09:00:35,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=655900.0, ans=0.125 2023-10-11 09:00:37,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.68 vs. limit=22.5 2023-10-11 09:00:49,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=655946.6666666666, ans=0.1 2023-10-11 09:01:01,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=655993.3333333334, ans=0.125 2023-10-11 09:01:10,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656040.0, ans=0.1 2023-10-11 09:01:30,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=656133.3333333334, ans=15.0 2023-10-11 09:01:39,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=656180.0, ans=0.125 2023-10-11 09:01:44,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.81 vs. limit=22.5 2023-10-11 09:01:51,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=656226.6666666666, ans=0.0 2023-10-11 09:01:58,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=656273.3333333334, ans=0.0 2023-10-11 09:02:00,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.747e+02 1.882e+02 2.164e+02 3.100e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 09:02:07,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=656320.0, ans=0.0 2023-10-11 09:02:09,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=656320.0, ans=0.125 2023-10-11 09:02:13,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=656320.0, ans=0.125 2023-10-11 09:02:17,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=656366.6666666666, ans=0.125 2023-10-11 09:02:39,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=656460.0, ans=0.125 2023-10-11 09:02:39,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=656460.0, ans=0.1 2023-10-11 09:03:01,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656506.6666666666, ans=0.1 2023-10-11 09:03:08,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=656553.3333333334, ans=0.1 2023-10-11 09:03:14,700 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.408e-02 2023-10-11 09:03:20,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=656600.0, ans=0.0 2023-10-11 09:03:54,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=656693.3333333334, ans=0.125 2023-10-11 09:03:58,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-10-11 09:04:05,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.95 vs. limit=15.0 2023-10-11 09:04:06,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.736e+02 1.997e+02 2.260e+02 4.004e+02, threshold=3.993e+02, percent-clipped=2.0 2023-10-11 09:04:21,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.36 vs. limit=15.0 2023-10-11 09:04:27,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=656833.3333333334, ans=0.125 2023-10-11 09:04:46,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=656880.0, ans=0.0 2023-10-11 09:04:53,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.73 vs. limit=10.0 2023-10-11 09:05:09,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-10-11 09:05:13,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657020.0, ans=0.1 2023-10-11 09:05:15,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=657020.0, ans=0.125 2023-10-11 09:05:26,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=657066.6666666666, ans=0.125 2023-10-11 09:05:51,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657160.0, ans=0.1 2023-10-11 09:05:56,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-10-11 09:05:58,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.645e+02 1.848e+02 2.109e+02 3.413e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-11 09:06:16,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=657300.0, ans=0.125 2023-10-11 09:06:27,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=657300.0, ans=0.2 2023-10-11 09:06:40,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=657393.3333333334, ans=0.025 2023-10-11 09:06:48,611 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:06:54,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=657440.0, ans=0.0 2023-10-11 09:06:56,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=657440.0, ans=0.0 2023-10-11 09:07:09,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.71 vs. limit=22.5 2023-10-11 09:07:11,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657533.3333333334, ans=0.1 2023-10-11 09:07:11,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=657533.3333333334, ans=0.0 2023-10-11 09:07:15,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=657533.3333333334, ans=10.0 2023-10-11 09:07:17,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657533.3333333334, ans=0.1 2023-10-11 09:07:32,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657580.0, ans=0.1 2023-10-11 09:07:52,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.793e+02 2.052e+02 2.305e+02 3.332e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-11 09:08:11,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-10-11 09:08:52,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=657906.6666666666, ans=0.0 2023-10-11 09:09:07,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657953.3333333334, ans=0.1 2023-10-11 09:09:08,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=657953.3333333334, ans=0.125 2023-10-11 09:09:23,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=658000.0, ans=0.2 2023-10-11 09:09:40,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=658093.3333333334, ans=0.125 2023-10-11 09:09:50,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=658140.0, ans=0.2 2023-10-11 09:09:51,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.680e+02 1.845e+02 2.079e+02 3.675e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 09:09:54,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.75 vs. limit=15.0 2023-10-11 09:10:10,566 INFO [train.py:1031] (3/4) Epoch 11, batch 4500, loss[loss=0.2121, simple_loss=0.2745, pruned_loss=0.07489, over 12812.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2927, pruned_loss=0.05811, over 29383573.63 frames. ], batch size: 440, lr: 3.27e-03, grad_scale: 32.0 2023-10-11 09:10:16,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.23 vs. limit=15.0 2023-10-11 09:10:31,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=658326.6666666666, ans=0.0 2023-10-11 09:10:32,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.18 vs. limit=12.0 2023-10-11 09:10:36,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=658326.6666666666, ans=0.125 2023-10-11 09:10:54,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=658420.0, ans=0.0 2023-10-11 09:11:11,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.30 vs. limit=10.0 2023-10-11 09:11:40,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.685e+02 1.846e+02 2.006e+02 3.057e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-11 09:11:43,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=658606.6666666666, ans=0.2 2023-10-11 09:12:12,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.04 vs. limit=15.0 2023-10-11 09:12:32,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=658840.0, ans=0.125 2023-10-11 09:13:32,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-10-11 09:13:34,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.738e+02 1.927e+02 2.249e+02 3.392e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-11 09:13:40,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=659120.0, ans=0.125 2023-10-11 09:13:46,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=659120.0, ans=0.2 2023-10-11 09:13:54,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=659166.6666666666, ans=0.0 2023-10-11 09:14:02,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=659213.3333333334, ans=0.125 2023-10-11 09:14:10,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=659213.3333333334, ans=0.05 2023-10-11 09:14:25,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=659260.0, ans=0.125 2023-10-11 09:14:25,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=659260.0, ans=0.2 2023-10-11 09:14:58,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=659400.0, ans=0.125 2023-10-11 09:15:18,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.30 vs. limit=15.0 2023-10-11 09:15:22,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-11 09:15:24,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=659540.0, ans=0.125 2023-10-11 09:15:25,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.692e+02 1.899e+02 2.139e+02 3.137e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-11 09:15:30,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=659540.0, ans=0.0 2023-10-11 09:15:32,679 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:15:38,897 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-11 09:15:40,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=15.0 2023-10-11 09:15:58,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=659680.0, ans=0.1 2023-10-11 09:16:21,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=659726.6666666666, ans=0.0 2023-10-11 09:16:26,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=659773.3333333334, ans=0.0 2023-10-11 09:16:34,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=659773.3333333334, ans=0.125 2023-10-11 09:16:54,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-11 09:17:02,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=659913.3333333334, ans=0.1 2023-10-11 09:17:17,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660006.6666666666, ans=0.1 2023-10-11 09:17:19,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=660006.6666666666, ans=0.125 2023-10-11 09:17:21,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.737e+02 1.877e+02 2.122e+02 2.625e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-11 09:17:43,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=660100.0, ans=0.125 2023-10-11 09:17:55,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=660146.6666666666, ans=0.125 2023-10-11 09:17:57,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=660146.6666666666, ans=0.2 2023-10-11 09:18:05,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=660193.3333333334, ans=0.0 2023-10-11 09:18:47,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=660333.3333333334, ans=0.0 2023-10-11 09:19:21,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.740e+02 1.897e+02 2.093e+02 2.860e+02, threshold=3.794e+02, percent-clipped=0.0 2023-10-11 09:19:35,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=660520.0, ans=0.0 2023-10-11 09:19:39,527 INFO [train.py:1031] (3/4) Epoch 11, batch 5000, loss[loss=0.1849, simple_loss=0.2749, pruned_loss=0.04746, over 16925.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2924, pruned_loss=0.05837, over 30122479.27 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2023-10-11 09:19:42,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=660566.6666666666, ans=0.125 2023-10-11 09:20:05,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=660660.0, ans=0.125 2023-10-11 09:20:35,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-11 09:20:45,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.77 vs. limit=15.0 2023-10-11 09:20:53,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=660846.6666666666, ans=0.035 2023-10-11 09:20:54,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=660893.3333333334, ans=0.125 2023-10-11 09:21:11,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.703e+02 1.873e+02 2.084e+02 3.090e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 09:21:37,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=661033.3333333334, ans=0.125 2023-10-11 09:21:45,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=661080.0, ans=0.125 2023-10-11 09:21:51,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=661080.0, ans=0.0 2023-10-11 09:21:56,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661126.6666666666, ans=0.125 2023-10-11 09:22:02,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-10-11 09:22:25,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=661220.0, ans=15.0 2023-10-11 09:22:32,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=661266.6666666666, ans=0.0 2023-10-11 09:22:33,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=661266.6666666666, ans=0.125 2023-10-11 09:22:35,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.28 vs. limit=15.0 2023-10-11 09:22:39,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=661313.3333333334, ans=0.125 2023-10-11 09:22:59,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=661406.6666666666, ans=0.125 2023-10-11 09:23:00,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=661406.6666666666, ans=0.2 2023-10-11 09:23:03,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.747e+02 1.923e+02 2.253e+02 3.251e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-11 09:23:12,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.95 vs. limit=15.0 2023-10-11 09:23:15,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-11 09:23:20,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-11 09:23:26,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=661500.0, ans=0.125 2023-10-11 09:23:32,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2023-10-11 09:23:34,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-10-11 09:23:38,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=661546.6666666666, ans=0.125 2023-10-11 09:23:45,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=661593.3333333334, ans=0.04949747468305833 2023-10-11 09:23:55,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-10-11 09:24:10,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=661686.6666666666, ans=0.0 2023-10-11 09:24:12,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=661686.6666666666, ans=0.125 2023-10-11 09:24:17,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=661733.3333333334, ans=0.2 2023-10-11 09:24:21,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=661733.3333333334, ans=0.125 2023-10-11 09:24:22,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-10-11 09:24:45,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=661826.6666666666, ans=0.125 2023-10-11 09:24:53,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=661873.3333333334, ans=0.125 2023-10-11 09:24:53,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661873.3333333334, ans=0.1 2023-10-11 09:24:54,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.670e+02 1.796e+02 1.975e+02 2.752e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 09:24:57,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=661873.3333333334, ans=0.2 2023-10-11 09:26:08,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=662153.3333333334, ans=0.125 2023-10-11 09:26:31,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=662246.6666666666, ans=0.125 2023-10-11 09:26:36,441 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:26:42,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=662293.3333333334, ans=0.1 2023-10-11 09:26:42,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=662293.3333333334, ans=0.0 2023-10-11 09:26:47,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.694e+02 1.904e+02 2.332e+02 3.854e+02, threshold=3.808e+02, percent-clipped=1.0 2023-10-11 09:26:59,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=662386.6666666666, ans=0.5 2023-10-11 09:27:00,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=662386.6666666666, ans=0.0 2023-10-11 09:27:07,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.30 vs. limit=6.0 2023-10-11 09:27:11,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-10-11 09:27:12,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=662433.3333333334, ans=0.0 2023-10-11 09:27:14,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.19 vs. limit=10.0 2023-10-11 09:27:17,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=662480.0, ans=0.0 2023-10-11 09:27:32,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=662526.6666666666, ans=0.0 2023-10-11 09:28:24,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.73 vs. limit=22.5 2023-10-11 09:28:32,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=662806.6666666666, ans=0.0 2023-10-11 09:28:34,031 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:28:35,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=662806.6666666666, ans=0.0 2023-10-11 09:28:37,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.718e+02 1.902e+02 2.149e+02 3.083e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 09:28:39,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.53 vs. limit=15.0 2023-10-11 09:28:54,010 INFO [train.py:1031] (3/4) Epoch 11, batch 5500, loss[loss=0.1884, simple_loss=0.2861, pruned_loss=0.04533, over 16907.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2921, pruned_loss=0.05808, over 30726627.51 frames. ], batch size: 165, lr: 3.26e-03, grad_scale: 32.0 2023-10-11 09:29:07,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-10-11 09:29:26,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-10-11 09:29:37,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=663086.6666666666, ans=0.0 2023-10-11 09:29:37,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663086.6666666666, ans=0.1 2023-10-11 09:29:42,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=663086.6666666666, ans=0.1 2023-10-11 09:29:46,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=663133.3333333334, ans=0.0 2023-10-11 09:30:04,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=663180.0, ans=0.125 2023-10-11 09:30:07,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.94 vs. limit=15.0 2023-10-11 09:30:11,999 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:30:20,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=663273.3333333334, ans=0.0 2023-10-11 09:30:20,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=663273.3333333334, ans=0.125 2023-10-11 09:30:22,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.708e+02 1.946e+02 2.384e+02 4.404e+02, threshold=3.893e+02, percent-clipped=1.0 2023-10-11 09:30:25,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.76 vs. limit=22.5 2023-10-11 09:30:28,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=663320.0, ans=0.125 2023-10-11 09:30:28,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=663320.0, ans=0.0 2023-10-11 09:30:41,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=663366.6666666666, ans=0.2 2023-10-11 09:30:46,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=663366.6666666666, ans=0.0 2023-10-11 09:30:49,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=663366.6666666666, ans=0.125 2023-10-11 09:30:53,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=663413.3333333334, ans=0.5 2023-10-11 09:31:03,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=663460.0, ans=0.0 2023-10-11 09:31:07,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=663460.0, ans=0.125 2023-10-11 09:31:22,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=663553.3333333334, ans=0.07 2023-10-11 09:31:25,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=663553.3333333334, ans=0.0 2023-10-11 09:31:38,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=663600.0, ans=0.125 2023-10-11 09:31:45,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=663646.6666666666, ans=0.0 2023-10-11 09:31:49,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=663646.6666666666, ans=0.0 2023-10-11 09:31:52,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=663646.6666666666, ans=0.125 2023-10-11 09:31:53,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=663646.6666666666, ans=0.125 2023-10-11 09:32:12,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.744e+02 1.949e+02 2.169e+02 3.030e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-11 09:32:19,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=663786.6666666666, ans=0.125 2023-10-11 09:32:22,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=663786.6666666666, ans=0.0 2023-10-11 09:32:22,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=663786.6666666666, ans=0.2 2023-10-11 09:32:38,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=663833.3333333334, ans=0.125 2023-10-11 09:33:08,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=663973.3333333334, ans=0.125 2023-10-11 09:33:09,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=663973.3333333334, ans=0.0 2023-10-11 09:33:15,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=663973.3333333334, ans=0.0 2023-10-11 09:33:15,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=663973.3333333334, ans=0.0 2023-10-11 09:33:25,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=12.0 2023-10-11 09:33:25,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2023-10-11 09:33:32,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=664066.6666666666, ans=0.125 2023-10-11 09:33:49,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664113.3333333334, ans=0.1 2023-10-11 09:33:54,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=664160.0, ans=0.0 2023-10-11 09:33:59,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=664160.0, ans=10.0 2023-10-11 09:34:08,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=664206.6666666666, ans=0.5 2023-10-11 09:34:09,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664206.6666666666, ans=0.1 2023-10-11 09:34:09,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.688e+02 1.867e+02 2.091e+02 3.041e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-11 09:34:59,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=664393.3333333334, ans=0.2 2023-10-11 09:35:09,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=664440.0, ans=0.0 2023-10-11 09:35:22,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=664486.6666666666, ans=0.0 2023-10-11 09:35:29,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=664533.3333333334, ans=0.1 2023-10-11 09:35:34,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=664533.3333333334, ans=0.0 2023-10-11 09:35:43,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=664580.0, ans=0.125 2023-10-11 09:36:05,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.770e+02 2.003e+02 2.348e+02 3.357e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-11 09:36:07,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=664673.3333333334, ans=10.0 2023-10-11 09:36:17,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=664720.0, ans=0.0 2023-10-11 09:36:29,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=664766.6666666666, ans=0.2 2023-10-11 09:36:45,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=664813.3333333334, ans=0.1 2023-10-11 09:36:46,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=664813.3333333334, ans=0.0 2023-10-11 09:37:03,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=664906.6666666666, ans=0.0 2023-10-11 09:37:09,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=664953.3333333334, ans=0.125 2023-10-11 09:37:11,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.80 vs. limit=22.5 2023-10-11 09:37:54,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.660e+02 1.851e+02 2.129e+02 3.305e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-11 09:37:58,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=665140.0, ans=0.125 2023-10-11 09:38:12,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=665233.3333333334, ans=0.0 2023-10-11 09:38:13,290 INFO [train.py:1031] (3/4) Epoch 11, batch 6000, loss[loss=0.2111, simple_loss=0.2992, pruned_loss=0.06151, over 16842.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2926, pruned_loss=0.05846, over 31202944.79 frames. ], batch size: 175, lr: 3.25e-03, grad_scale: 32.0 2023-10-11 09:38:13,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=665233.3333333334, ans=0.1 2023-10-11 09:38:36,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=665326.6666666666, ans=0.0 2023-10-11 09:38:44,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=665326.6666666666, ans=0.035 2023-10-11 09:38:58,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=665420.0, ans=0.0 2023-10-11 09:39:10,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=665466.6666666666, ans=0.0 2023-10-11 09:39:11,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-10-11 09:39:15,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=665466.6666666666, ans=0.125 2023-10-11 09:39:31,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=665560.0, ans=0.1 2023-10-11 09:39:40,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=665560.0, ans=0.125 2023-10-11 09:39:46,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.735e+02 1.878e+02 2.158e+02 3.132e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 09:39:58,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=665653.3333333334, ans=0.04949747468305833 2023-10-11 09:39:58,991 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:40:08,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2023-10-11 09:40:44,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=665840.0, ans=0.0 2023-10-11 09:41:02,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=665933.3333333334, ans=0.0 2023-10-11 09:41:09,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=665980.0, ans=0.125 2023-10-11 09:41:35,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.742e+02 1.917e+02 2.156e+02 2.972e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-11 09:41:42,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=666120.0, ans=0.125 2023-10-11 09:41:54,375 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-10-11 09:42:19,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=666260.0, ans=0.0 2023-10-11 09:42:25,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=12.0 2023-10-11 09:42:41,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=666353.3333333334, ans=0.125 2023-10-11 09:42:50,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=666400.0, ans=0.125 2023-10-11 09:42:55,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=666400.0, ans=0.5 2023-10-11 09:43:30,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.702e+02 1.919e+02 2.149e+02 3.434e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-11 09:43:45,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=666586.6666666666, ans=0.1 2023-10-11 09:44:06,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=666680.0, ans=0.0 2023-10-11 09:44:10,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=666726.6666666666, ans=0.2 2023-10-11 09:44:12,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=666726.6666666666, ans=0.1 2023-10-11 09:44:13,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.34 vs. limit=15.0 2023-10-11 09:44:16,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=666726.6666666666, ans=0.125 2023-10-11 09:44:21,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=666773.3333333334, ans=0.2 2023-10-11 09:44:29,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=15.0 2023-10-11 09:44:48,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=666866.6666666666, ans=0.2 2023-10-11 09:45:12,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=666960.0, ans=0.125 2023-10-11 09:45:19,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=666960.0, ans=0.0 2023-10-11 09:45:26,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=667006.6666666666, ans=0.015 2023-10-11 09:45:30,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.696e+02 1.908e+02 2.183e+02 3.331e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-11 09:45:47,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=667100.0, ans=0.0 2023-10-11 09:46:33,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-10-11 09:46:33,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-10-11 09:46:34,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-10-11 09:46:51,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=667380.0, ans=0.125 2023-10-11 09:47:08,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667426.6666666666, ans=0.1 2023-10-11 09:47:24,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.716e+02 1.900e+02 2.198e+02 3.623e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-11 09:47:42,488 INFO [train.py:1031] (3/4) Epoch 11, batch 6500, loss[loss=0.2256, simple_loss=0.3179, pruned_loss=0.06659, over 16849.00 frames. ], tot_loss[loss=0.205, simple_loss=0.293, pruned_loss=0.05851, over 31558282.58 frames. ], batch size: 146, lr: 3.25e-03, grad_scale: 32.0 2023-10-11 09:47:49,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667566.6666666666, ans=0.1 2023-10-11 09:47:58,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.41 vs. limit=10.0 2023-10-11 09:47:59,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=667613.3333333334, ans=0.125 2023-10-11 09:48:17,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=667660.0, ans=0.125 2023-10-11 09:48:28,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=667706.6666666666, ans=0.125 2023-10-11 09:48:56,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=667800.0, ans=0.125 2023-10-11 09:49:04,056 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:49:14,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=667893.3333333334, ans=0.2 2023-10-11 09:49:25,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=667940.0, ans=0.2 2023-10-11 09:49:28,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=667940.0, ans=0.0 2023-10-11 09:49:31,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.696e+02 1.869e+02 2.144e+02 2.746e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 09:49:41,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=667986.6666666666, ans=0.125 2023-10-11 09:49:42,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=667986.6666666666, ans=0.05 2023-10-11 09:49:49,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=668033.3333333334, ans=0.1 2023-10-11 09:50:14,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=668126.6666666666, ans=0.0 2023-10-11 09:50:14,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=668126.6666666666, ans=0.0 2023-10-11 09:50:15,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=668126.6666666666, ans=0.0 2023-10-11 09:50:21,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-11 09:50:23,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=668173.3333333334, ans=0.125 2023-10-11 09:50:32,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=668220.0, ans=0.125 2023-10-11 09:50:41,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=668266.6666666666, ans=0.125 2023-10-11 09:50:49,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=668266.6666666666, ans=0.125 2023-10-11 09:51:20,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.701e+02 1.831e+02 2.005e+02 2.995e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-11 09:51:27,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=668453.3333333334, ans=0.125 2023-10-11 09:51:29,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=668453.3333333334, ans=0.125 2023-10-11 09:51:38,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=668500.0, ans=0.015 2023-10-11 09:51:47,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=668546.6666666666, ans=10.0 2023-10-11 09:51:51,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=668546.6666666666, ans=0.0 2023-10-11 09:52:05,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-10-11 09:52:08,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=22.5 2023-10-11 09:52:35,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=668733.3333333334, ans=0.125 2023-10-11 09:52:55,108 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:53:02,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=668826.6666666666, ans=0.1 2023-10-11 09:53:04,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=668826.6666666666, ans=0.125 2023-10-11 09:53:16,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.743e+02 1.915e+02 2.188e+02 3.013e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-11 09:53:39,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=668966.6666666666, ans=0.0 2023-10-11 09:53:41,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.46 vs. limit=15.0 2023-10-11 09:53:49,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=668966.6666666666, ans=0.0 2023-10-11 09:54:16,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=669060.0, ans=0.1 2023-10-11 09:54:28,640 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:54:45,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=669200.0, ans=0.125 2023-10-11 09:54:54,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=669246.6666666666, ans=0.0 2023-10-11 09:54:57,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=669246.6666666666, ans=0.125 2023-10-11 09:55:12,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=669293.3333333334, ans=0.0 2023-10-11 09:55:23,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.633e+02 1.765e+02 2.073e+02 2.917e+02, threshold=3.530e+02, percent-clipped=0.0 2023-10-11 09:55:25,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=669340.0, ans=0.0 2023-10-11 09:55:34,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=669386.6666666666, ans=0.1 2023-10-11 09:55:39,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=669386.6666666666, ans=0.125 2023-10-11 09:55:52,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=669480.0, ans=0.0 2023-10-11 09:56:16,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-10-11 09:56:18,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=669573.3333333334, ans=0.0 2023-10-11 09:56:29,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=669620.0, ans=0.0 2023-10-11 09:56:30,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=669620.0, ans=0.09899494936611666 2023-10-11 09:56:32,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=669620.0, ans=0.125 2023-10-11 09:56:32,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=669620.0, ans=0.125 2023-10-11 09:57:02,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.31 vs. limit=6.0 2023-10-11 09:57:14,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.745e+02 2.058e+02 2.383e+02 3.136e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-11 09:57:20,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=22.5 2023-10-11 09:57:21,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=669853.3333333334, ans=0.125 2023-10-11 09:57:22,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=669853.3333333334, ans=0.125 2023-10-11 09:57:28,102 INFO [train.py:1031] (3/4) Epoch 11, batch 7000, loss[loss=0.2216, simple_loss=0.3051, pruned_loss=0.06908, over 16811.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2931, pruned_loss=0.05827, over 31818351.86 frames. ], batch size: 188, lr: 3.24e-03, grad_scale: 16.0 2023-10-11 09:57:37,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=669900.0, ans=0.125 2023-10-11 09:57:55,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=669993.3333333334, ans=0.125 2023-10-11 09:58:00,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=669993.3333333334, ans=0.2 2023-10-11 09:58:13,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=670086.6666666666, ans=0.0 2023-10-11 09:58:17,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=670086.6666666666, ans=0.0 2023-10-11 09:58:41,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=22.5 2023-10-11 09:58:52,860 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.34 vs. limit=15.0 2023-10-11 09:59:03,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.661e+02 1.822e+02 2.011e+02 2.684e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 09:59:11,279 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.542e-03 2023-10-11 09:59:11,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.15 vs. limit=15.0 2023-10-11 09:59:22,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.42 vs. limit=15.0 2023-10-11 09:59:27,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.88 vs. limit=22.5 2023-10-11 09:59:50,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.48 vs. limit=10.0 2023-10-11 10:00:24,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=670646.6666666666, ans=0.125 2023-10-11 10:00:33,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=670693.3333333334, ans=0.07 2023-10-11 10:00:44,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=670740.0, ans=0.125 2023-10-11 10:00:49,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=670740.0, ans=0.09899494936611666 2023-10-11 10:00:51,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.700e+02 1.824e+02 1.999e+02 2.782e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-11 10:00:58,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=670786.6666666666, ans=0.125 2023-10-11 10:01:47,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=670926.6666666666, ans=0.125 2023-10-11 10:02:04,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-10-11 10:02:08,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=671020.0, ans=0.125 2023-10-11 10:02:09,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.71 vs. limit=22.5 2023-10-11 10:02:12,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-10-11 10:02:22,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=671066.6666666666, ans=0.2 2023-10-11 10:02:35,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=671113.3333333334, ans=0.2 2023-10-11 10:02:57,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-10-11 10:03:02,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.704e+02 1.836e+02 2.086e+02 3.425e+02, threshold=3.672e+02, percent-clipped=0.0 2023-10-11 10:03:22,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=671300.0, ans=0.0 2023-10-11 10:03:25,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=671300.0, ans=0.2 2023-10-11 10:03:34,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=671346.6666666666, ans=0.04949747468305833 2023-10-11 10:03:38,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=671393.3333333334, ans=0.125 2023-10-11 10:03:47,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.20 vs. limit=15.0 2023-10-11 10:03:56,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=671440.0, ans=0.125 2023-10-11 10:04:09,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=671486.6666666666, ans=0.0 2023-10-11 10:04:13,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=671486.6666666666, ans=0.07 2023-10-11 10:04:26,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=671533.3333333334, ans=0.125 2023-10-11 10:04:33,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671580.0, ans=0.1 2023-10-11 10:04:36,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=671580.0, ans=0.125 2023-10-11 10:04:40,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671626.6666666666, ans=0.1 2023-10-11 10:04:55,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=671673.3333333334, ans=0.0 2023-10-11 10:04:59,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.705e+02 1.901e+02 2.252e+02 3.323e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-11 10:05:01,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=671720.0, ans=0.0 2023-10-11 10:05:09,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.82 vs. limit=22.5 2023-10-11 10:05:47,974 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2023-10-11 10:05:48,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671906.6666666666, ans=0.1 2023-10-11 10:05:52,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=671906.6666666666, ans=0.04949747468305833 2023-10-11 10:05:55,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671906.6666666666, ans=0.1 2023-10-11 10:06:02,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=671953.3333333334, ans=0.0 2023-10-11 10:06:04,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=671953.3333333334, ans=0.125 2023-10-11 10:06:12,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=672000.0, ans=0.09899494936611666 2023-10-11 10:06:23,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=672046.6666666666, ans=0.125 2023-10-11 10:06:44,371 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.90 vs. limit=15.0 2023-10-11 10:06:51,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.752e+02 2.003e+02 2.365e+02 3.601e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-11 10:07:06,833 INFO [train.py:1031] (3/4) Epoch 11, batch 7500, loss[loss=0.1936, simple_loss=0.2839, pruned_loss=0.05168, over 16864.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.293, pruned_loss=0.05832, over 32048410.20 frames. ], batch size: 146, lr: 3.24e-03, grad_scale: 16.0 2023-10-11 10:07:07,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=672233.3333333334, ans=0.125 2023-10-11 10:07:09,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=672233.3333333334, ans=0.0 2023-10-11 10:07:09,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=672233.3333333334, ans=0.5 2023-10-11 10:07:22,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=672280.0, ans=0.1 2023-10-11 10:07:26,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=672280.0, ans=0.1 2023-10-11 10:07:58,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=672420.0, ans=0.2 2023-10-11 10:08:00,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=672420.0, ans=0.0 2023-10-11 10:08:07,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-10-11 10:08:25,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=672560.0, ans=0.09899494936611666 2023-10-11 10:08:26,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=672560.0, ans=0.125 2023-10-11 10:08:27,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=672560.0, ans=0.1 2023-10-11 10:08:36,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=672606.6666666666, ans=0.125 2023-10-11 10:08:41,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=672606.6666666666, ans=0.125 2023-10-11 10:08:45,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.734e+02 1.910e+02 2.143e+02 2.869e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-11 10:09:01,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=672700.0, ans=0.125 2023-10-11 10:09:26,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=672793.3333333334, ans=0.125 2023-10-11 10:10:04,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=672933.3333333334, ans=0.125 2023-10-11 10:10:09,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=672933.3333333334, ans=0.0 2023-10-11 10:10:21,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=672980.0, ans=0.2 2023-10-11 10:10:48,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.632e+02 1.838e+02 2.146e+02 3.121e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-11 10:11:07,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=673166.6666666666, ans=0.2 2023-10-11 10:11:09,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=673166.6666666666, ans=0.1 2023-10-11 10:11:14,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.34 vs. limit=15.0 2023-10-11 10:11:26,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=673260.0, ans=0.125 2023-10-11 10:11:31,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=673260.0, ans=0.0 2023-10-11 10:11:33,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=673260.0, ans=0.0 2023-10-11 10:11:59,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=673400.0, ans=0.125 2023-10-11 10:11:59,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=673400.0, ans=0.125 2023-10-11 10:12:08,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=673446.6666666666, ans=0.125 2023-10-11 10:12:16,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=673493.3333333334, ans=0.125 2023-10-11 10:12:22,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=673493.3333333334, ans=0.0 2023-10-11 10:12:35,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.684e+02 1.863e+02 2.093e+02 3.310e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 10:12:37,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.92 vs. limit=15.0 2023-10-11 10:12:42,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-10-11 10:12:48,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=673633.3333333334, ans=0.1 2023-10-11 10:13:05,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.13 vs. limit=22.5 2023-10-11 10:13:07,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=673680.0, ans=0.0 2023-10-11 10:13:17,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=673726.6666666666, ans=0.0 2023-10-11 10:13:30,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=673773.3333333334, ans=0.2 2023-10-11 10:13:52,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.36 vs. limit=15.0 2023-10-11 10:14:17,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=673960.0, ans=0.015 2023-10-11 10:14:24,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.85 vs. limit=15.0 2023-10-11 10:14:30,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.695e+02 1.922e+02 2.072e+02 2.924e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-11 10:15:05,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=674146.6666666666, ans=0.0 2023-10-11 10:15:13,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=674193.3333333334, ans=0.0 2023-10-11 10:15:19,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=674240.0, ans=0.0 2023-10-11 10:15:27,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=674240.0, ans=0.125 2023-10-11 10:15:43,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=674333.3333333334, ans=0.125 2023-10-11 10:15:45,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=674333.3333333334, ans=0.125 2023-10-11 10:16:01,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=22.5 2023-10-11 10:16:28,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.591e+02 1.735e+02 1.970e+02 2.940e+02, threshold=3.470e+02, percent-clipped=0.0 2023-10-11 10:16:38,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=674520.0, ans=0.2 2023-10-11 10:16:39,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=674566.6666666666, ans=0.2 2023-10-11 10:16:40,546 INFO [train.py:1031] (3/4) Epoch 11, batch 8000, loss[loss=0.1922, simple_loss=0.2881, pruned_loss=0.04813, over 16819.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2923, pruned_loss=0.05752, over 32247054.49 frames. ], batch size: 146, lr: 3.23e-03, grad_scale: 32.0 2023-10-11 10:16:40,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=674566.6666666666, ans=0.125 2023-10-11 10:16:58,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=674613.3333333334, ans=0.0 2023-10-11 10:17:02,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=674660.0, ans=0.125 2023-10-11 10:17:12,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.05 vs. limit=22.5 2023-10-11 10:17:25,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=674753.3333333334, ans=15.0 2023-10-11 10:17:35,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-10-11 10:17:36,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=674800.0, ans=0.125 2023-10-11 10:18:13,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.266e+02 1.642e+02 1.794e+02 2.027e+02 3.386e+02, threshold=3.588e+02, percent-clipped=0.0 2023-10-11 10:18:18,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=674986.6666666666, ans=0.0 2023-10-11 10:18:28,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=675033.3333333334, ans=0.1 2023-10-11 10:18:34,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=675033.3333333334, ans=0.2 2023-10-11 10:18:45,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=675080.0, ans=0.125 2023-10-11 10:18:45,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=22.5 2023-10-11 10:18:50,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=675126.6666666666, ans=0.125 2023-10-11 10:18:52,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=675126.6666666666, ans=0.0 2023-10-11 10:19:02,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=675173.3333333334, ans=0.125 2023-10-11 10:19:06,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=675173.3333333334, ans=0.125 2023-10-11 10:19:16,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=675220.0, ans=0.2 2023-10-11 10:19:18,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=675220.0, ans=0.125 2023-10-11 10:20:22,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.697e+02 1.814e+02 2.042e+02 2.908e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-11 10:20:36,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=675500.0, ans=0.0 2023-10-11 10:20:37,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=675500.0, ans=0.0 2023-10-11 10:20:54,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=675546.6666666666, ans=0.125 2023-10-11 10:21:18,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=22.5 2023-10-11 10:21:23,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.81 vs. limit=15.0 2023-10-11 10:21:25,058 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-11 10:21:41,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=675733.3333333334, ans=0.125 2023-10-11 10:21:48,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=675780.0, ans=0.125 2023-10-11 10:22:00,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=675826.6666666666, ans=0.125 2023-10-11 10:22:00,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=675826.6666666666, ans=0.125 2023-10-11 10:22:10,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=15.0 2023-10-11 10:22:14,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.687e+02 1.833e+02 2.058e+02 3.511e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-11 10:22:26,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=675920.0, ans=0.125 2023-10-11 10:22:28,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=675920.0, ans=0.125 2023-10-11 10:22:55,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=676060.0, ans=0.125 2023-10-11 10:23:04,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=676106.6666666666, ans=0.025 2023-10-11 10:23:09,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=676106.6666666666, ans=0.0 2023-10-11 10:23:20,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-10-11 10:23:34,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=676200.0, ans=0.125 2023-10-11 10:23:50,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-10-11 10:23:59,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=676340.0, ans=0.0 2023-10-11 10:24:05,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=676340.0, ans=0.0 2023-10-11 10:24:06,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.681e+02 1.924e+02 2.224e+02 4.002e+02, threshold=3.848e+02, percent-clipped=3.0 2023-10-11 10:24:52,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=676526.6666666666, ans=0.1 2023-10-11 10:25:02,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=676573.3333333334, ans=0.0 2023-10-11 10:25:06,701 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.24 vs. limit=15.0 2023-10-11 10:25:18,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=676620.0, ans=0.1 2023-10-11 10:25:20,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=676666.6666666666, ans=0.0 2023-10-11 10:25:33,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.43 vs. limit=22.5 2023-10-11 10:25:34,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=676713.3333333334, ans=0.2 2023-10-11 10:25:40,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=676760.0, ans=0.125 2023-10-11 10:25:46,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.99 vs. limit=22.5 2023-10-11 10:25:48,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=676760.0, ans=0.125 2023-10-11 10:26:02,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.669e+02 1.852e+02 2.064e+02 2.874e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-11 10:26:09,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.19 vs. limit=6.0 2023-10-11 10:26:10,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=676853.3333333334, ans=0.1 2023-10-11 10:26:19,511 INFO [train.py:1031] (3/4) Epoch 11, batch 8500, loss[loss=0.1734, simple_loss=0.2701, pruned_loss=0.03834, over 16831.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2925, pruned_loss=0.05737, over 32384436.80 frames. ], batch size: 98, lr: 3.23e-03, grad_scale: 32.0 2023-10-11 10:26:27,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=676900.0, ans=0.125 2023-10-11 10:26:32,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=676946.6666666666, ans=0.0 2023-10-11 10:26:36,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=676946.6666666666, ans=0.2 2023-10-11 10:26:40,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=676993.3333333334, ans=0.0 2023-10-11 10:26:58,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=677040.0, ans=0.125 2023-10-11 10:27:26,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=677180.0, ans=0.0 2023-10-11 10:27:59,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.725e+02 1.888e+02 2.091e+02 2.656e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-11 10:28:07,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=677320.0, ans=0.125 2023-10-11 10:28:17,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=677366.6666666666, ans=0.125 2023-10-11 10:28:18,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=677366.6666666666, ans=0.125 2023-10-11 10:28:25,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=677366.6666666666, ans=0.125 2023-10-11 10:28:35,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=677413.3333333334, ans=0.0 2023-10-11 10:28:46,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=677460.0, ans=0.125 2023-10-11 10:28:52,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=677506.6666666666, ans=0.0 2023-10-11 10:28:58,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=677506.6666666666, ans=0.2 2023-10-11 10:29:10,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=677553.3333333334, ans=0.2 2023-10-11 10:29:15,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.92 vs. limit=22.5 2023-10-11 10:29:32,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=677646.6666666666, ans=0.0 2023-10-11 10:29:40,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-10-11 10:29:42,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-10-11 10:29:54,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=22.5 2023-10-11 10:30:03,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.607e+02 1.790e+02 2.006e+02 2.754e+02, threshold=3.579e+02, percent-clipped=0.0 2023-10-11 10:30:04,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=677786.6666666666, ans=0.0 2023-10-11 10:30:08,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677786.6666666666, ans=0.1 2023-10-11 10:30:13,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=677786.6666666666, ans=0.125 2023-10-11 10:30:14,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=677786.6666666666, ans=0.125 2023-10-11 10:30:27,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=677833.3333333334, ans=10.0 2023-10-11 10:30:45,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=677926.6666666666, ans=0.2 2023-10-11 10:30:52,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=677973.3333333334, ans=0.0 2023-10-11 10:31:30,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=678113.3333333334, ans=0.125 2023-10-11 10:31:43,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=678160.0, ans=0.125 2023-10-11 10:31:46,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-10-11 10:32:02,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.546e+02 1.712e+02 1.966e+02 3.960e+02, threshold=3.425e+02, percent-clipped=1.0 2023-10-11 10:32:05,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=678253.3333333334, ans=0.1 2023-10-11 10:32:09,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=678253.3333333334, ans=0.0 2023-10-11 10:32:27,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=678346.6666666666, ans=0.0 2023-10-11 10:32:28,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=678346.6666666666, ans=0.125 2023-10-11 10:32:38,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=678393.3333333334, ans=0.1 2023-10-11 10:33:05,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=678486.6666666666, ans=0.0 2023-10-11 10:33:08,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=678533.3333333334, ans=0.0 2023-10-11 10:33:23,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=678580.0, ans=0.5 2023-10-11 10:33:25,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-11 10:33:28,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=678626.6666666666, ans=0.0 2023-10-11 10:33:44,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=678673.3333333334, ans=0.125 2023-10-11 10:33:44,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=678673.3333333334, ans=0.125 2023-10-11 10:33:48,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=678673.3333333334, ans=0.2 2023-10-11 10:33:49,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.624e+02 1.784e+02 2.011e+02 2.558e+02, threshold=3.568e+02, percent-clipped=0.0 2023-10-11 10:33:52,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=678720.0, ans=0.125 2023-10-11 10:34:20,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=678813.3333333334, ans=0.0 2023-10-11 10:34:22,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=678813.3333333334, ans=0.2 2023-10-11 10:34:40,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=678906.6666666666, ans=0.0 2023-10-11 10:34:50,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=678953.3333333334, ans=0.035 2023-10-11 10:34:55,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=678953.3333333334, ans=0.125 2023-10-11 10:35:04,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=679000.0, ans=0.0 2023-10-11 10:35:40,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.726e+02 1.944e+02 2.100e+02 2.887e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-11 10:35:50,295 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-10-11 10:35:53,883 INFO [train.py:1031] (3/4) Epoch 11, batch 9000, loss[loss=0.201, simple_loss=0.2915, pruned_loss=0.05525, over 16591.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2919, pruned_loss=0.05711, over 32499042.27 frames. ], batch size: 56, lr: 3.22e-03, grad_scale: 32.0 2023-10-11 10:36:02,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=15.0 2023-10-11 10:36:45,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=679420.0, ans=0.125 2023-10-11 10:36:54,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=12.0 2023-10-11 10:37:03,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.93 vs. limit=22.5 2023-10-11 10:37:28,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.595e+02 1.750e+02 2.007e+02 2.919e+02, threshold=3.500e+02, percent-clipped=0.0 2023-10-11 10:37:35,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.87 vs. limit=22.5 2023-10-11 10:37:51,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=679700.0, ans=0.0 2023-10-11 10:38:12,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=679793.3333333334, ans=0.09899494936611666 2023-10-11 10:38:36,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=679933.3333333334, ans=0.0 2023-10-11 10:38:56,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=680026.6666666666, ans=0.1 2023-10-11 10:39:08,320 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:39:11,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=680073.3333333334, ans=0.0 2023-10-11 10:39:15,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.731e+02 1.942e+02 2.109e+02 3.154e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-11 10:39:22,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-10-11 10:39:26,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=680120.0, ans=0.0 2023-10-11 10:39:35,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=680166.6666666666, ans=0.125 2023-10-11 10:39:39,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=680213.3333333334, ans=0.2 2023-10-11 10:39:44,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=680213.3333333334, ans=0.2 2023-10-11 10:39:46,375 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-11 10:40:19,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=680400.0, ans=0.2 2023-10-11 10:40:19,712 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.53 vs. limit=10.0 2023-10-11 10:40:21,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.72 vs. limit=15.0 2023-10-11 10:40:35,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.96 vs. limit=22.5 2023-10-11 10:41:01,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.729e+02 1.867e+02 2.056e+02 3.247e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 10:41:13,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-11 10:41:13,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=680633.3333333334, ans=0.125 2023-10-11 10:41:25,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=680680.0, ans=0.0 2023-10-11 10:41:33,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-11 10:41:50,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=680773.3333333334, ans=0.125 2023-10-11 10:41:55,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=680773.3333333334, ans=0.0 2023-10-11 10:41:57,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-10-11 10:42:04,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=680820.0, ans=0.1 2023-10-11 10:42:42,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=680960.0, ans=0.125 2023-10-11 10:42:49,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=680960.0, ans=0.125 2023-10-11 10:42:51,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=680960.0, ans=0.0 2023-10-11 10:43:03,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.724e+02 1.913e+02 2.140e+02 2.952e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 10:43:07,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-11 10:43:13,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=681053.3333333334, ans=0.5 2023-10-11 10:43:18,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.15 vs. limit=10.0 2023-10-11 10:43:35,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=681146.6666666666, ans=0.0 2023-10-11 10:43:51,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=681240.0, ans=0.125 2023-10-11 10:44:05,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=681286.6666666666, ans=0.125 2023-10-11 10:44:06,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=681286.6666666666, ans=0.0 2023-10-11 10:44:09,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=681286.6666666666, ans=0.0 2023-10-11 10:44:09,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=681286.6666666666, ans=0.07 2023-10-11 10:44:10,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681286.6666666666, ans=0.1 2023-10-11 10:44:30,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=681380.0, ans=0.04949747468305833 2023-10-11 10:44:35,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=681380.0, ans=0.125 2023-10-11 10:44:52,071 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-10-11 10:45:04,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.758e+02 1.909e+02 2.214e+02 3.477e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-11 10:45:05,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=681520.0, ans=0.2 2023-10-11 10:45:16,547 INFO [train.py:1031] (3/4) Epoch 11, batch 9500, loss[loss=0.2517, simple_loss=0.3229, pruned_loss=0.09022, over 15776.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2928, pruned_loss=0.05751, over 32586383.37 frames. ], batch size: 350, lr: 3.21e-03, grad_scale: 32.0 2023-10-11 10:45:42,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=681660.0, ans=0.0 2023-10-11 10:45:49,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=681660.0, ans=0.0 2023-10-11 10:46:02,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=681753.3333333334, ans=0.0 2023-10-11 10:46:05,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=681753.3333333334, ans=0.125 2023-10-11 10:46:09,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-11 10:46:15,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=681800.0, ans=0.125 2023-10-11 10:46:23,472 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:46:27,236 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:46:48,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=681940.0, ans=0.125 2023-10-11 10:46:56,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.727e+02 1.912e+02 2.111e+02 3.049e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-11 10:46:57,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=681986.6666666666, ans=0.125 2023-10-11 10:46:58,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=681986.6666666666, ans=0.125 2023-10-11 10:47:04,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=681986.6666666666, ans=0.0 2023-10-11 10:47:58,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2023-10-11 10:48:07,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=682266.6666666666, ans=15.0 2023-10-11 10:48:24,928 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:48:28,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=682360.0, ans=0.0 2023-10-11 10:48:47,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682406.6666666666, ans=0.1 2023-10-11 10:48:49,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.657e+02 1.883e+02 2.135e+02 3.290e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-11 10:48:54,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-10-11 10:49:33,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=682640.0, ans=0.125 2023-10-11 10:49:35,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=682640.0, ans=0.0 2023-10-11 10:50:02,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=682733.3333333334, ans=0.125 2023-10-11 10:50:07,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=682780.0, ans=0.125 2023-10-11 10:50:27,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=682826.6666666666, ans=0.2 2023-10-11 10:50:32,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=682873.3333333334, ans=0.0 2023-10-11 10:50:39,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=682873.3333333334, ans=0.125 2023-10-11 10:50:41,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.613e+02 1.788e+02 2.044e+02 2.619e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 10:51:05,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=683013.3333333334, ans=0.125 2023-10-11 10:51:09,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.94 vs. limit=22.5 2023-10-11 10:51:12,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=683060.0, ans=0.125 2023-10-11 10:51:41,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=683153.3333333334, ans=0.0 2023-10-11 10:51:42,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=683153.3333333334, ans=0.125 2023-10-11 10:51:42,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=683153.3333333334, ans=0.125 2023-10-11 10:51:44,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=683153.3333333334, ans=0.125 2023-10-11 10:51:46,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=683200.0, ans=0.125 2023-10-11 10:51:58,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=683246.6666666666, ans=0.0 2023-10-11 10:52:34,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.671e+02 1.881e+02 2.100e+02 2.917e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-11 10:52:36,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=683386.6666666666, ans=0.0 2023-10-11 10:52:42,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=683386.6666666666, ans=0.0 2023-10-11 10:52:42,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=683386.6666666666, ans=0.95 2023-10-11 10:53:03,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=683480.0, ans=0.125 2023-10-11 10:53:20,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=683573.3333333334, ans=0.2 2023-10-11 10:53:28,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=683620.0, ans=0.125 2023-10-11 10:54:00,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=683760.0, ans=0.0 2023-10-11 10:54:05,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=683760.0, ans=0.125 2023-10-11 10:54:15,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=683806.6666666666, ans=0.0 2023-10-11 10:54:20,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.687e+02 1.856e+02 2.075e+02 3.186e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 10:54:23,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=683853.3333333334, ans=0.125 2023-10-11 10:54:25,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=683853.3333333334, ans=0.5 2023-10-11 10:54:26,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683853.3333333334, ans=0.1 2023-10-11 10:54:28,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=683853.3333333334, ans=0.125 2023-10-11 10:54:31,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-10-11 10:54:33,191 INFO [train.py:1031] (3/4) Epoch 11, batch 10000, loss[loss=0.2192, simple_loss=0.306, pruned_loss=0.0662, over 16819.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2921, pruned_loss=0.05722, over 32660323.86 frames. ], batch size: 146, lr: 3.21e-03, grad_scale: 32.0 2023-10-11 10:54:42,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=22.5 2023-10-11 10:54:50,558 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:54:55,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=683993.3333333334, ans=0.125 2023-10-11 10:54:55,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.74 vs. limit=15.0 2023-10-11 10:54:58,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=683993.3333333334, ans=0.125 2023-10-11 10:55:05,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=684040.0, ans=0.0 2023-10-11 10:55:08,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=684040.0, ans=0.2 2023-10-11 10:55:11,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=684040.0, ans=10.0 2023-10-11 10:55:22,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=684086.6666666666, ans=0.125 2023-10-11 10:55:32,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=684133.3333333334, ans=0.5 2023-10-11 10:55:42,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.62 vs. limit=22.5 2023-10-11 10:56:02,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-10-11 10:56:11,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.21 vs. limit=15.0 2023-10-11 10:56:14,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.767e+02 1.968e+02 2.230e+02 3.025e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-11 10:56:34,645 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:56:52,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.99 vs. limit=15.0 2023-10-11 10:57:00,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=684460.0, ans=0.125 2023-10-11 10:57:13,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=684553.3333333334, ans=0.07 2023-10-11 10:57:20,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=684553.3333333334, ans=0.04949747468305833 2023-10-11 10:57:36,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=684646.6666666666, ans=0.0 2023-10-11 10:57:57,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.46 vs. limit=15.0 2023-10-11 10:58:03,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=684740.0, ans=0.1 2023-10-11 10:58:04,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.810e+02 2.105e+02 2.494e+02 3.949e+02, threshold=4.211e+02, percent-clipped=1.0 2023-10-11 10:58:19,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=684833.3333333334, ans=0.0 2023-10-11 10:58:34,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=684880.0, ans=0.0 2023-10-11 10:58:36,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=684880.0, ans=0.125 2023-10-11 10:58:40,881 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:58:40,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=684880.0, ans=0.125 2023-10-11 10:59:07,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=684973.3333333334, ans=0.0 2023-10-11 10:59:25,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=685066.6666666666, ans=0.125 2023-10-11 10:59:28,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=685066.6666666666, ans=0.0 2023-10-11 10:59:32,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=685113.3333333334, ans=0.125 2023-10-11 10:59:39,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=685113.3333333334, ans=0.0 2023-10-11 10:59:49,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=685160.0, ans=10.0 2023-10-11 10:59:49,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=685160.0, ans=0.0 2023-10-11 11:00:07,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-10-11 11:00:07,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.625e+02 1.795e+02 2.091e+02 2.970e+02, threshold=3.591e+02, percent-clipped=0.0 2023-10-11 11:00:12,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=685253.3333333334, ans=0.1 2023-10-11 11:00:22,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=685300.0, ans=0.125 2023-10-11 11:00:24,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=685300.0, ans=0.2 2023-10-11 11:00:30,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=685300.0, ans=0.125 2023-10-11 11:00:32,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=685300.0, ans=0.125 2023-10-11 11:00:42,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.08 vs. limit=22.5 2023-10-11 11:00:56,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=685440.0, ans=22.5 2023-10-11 11:01:11,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=685486.6666666666, ans=0.0 2023-10-11 11:01:11,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=685486.6666666666, ans=0.125 2023-10-11 11:01:23,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-11 11:01:25,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=685533.3333333334, ans=0.0 2023-10-11 11:01:26,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=685533.3333333334, ans=0.125 2023-10-11 11:01:33,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-10-11 11:01:39,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=685626.6666666666, ans=0.125 2023-10-11 11:01:42,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=685626.6666666666, ans=0.125 2023-10-11 11:01:47,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=685626.6666666666, ans=0.0 2023-10-11 11:01:58,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.695e+02 1.860e+02 2.112e+02 3.160e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-11 11:02:02,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-10-11 11:02:13,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=685766.6666666666, ans=0.1 2023-10-11 11:02:22,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2023-10-11 11:02:28,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-11 11:02:28,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=685813.3333333334, ans=0.07 2023-10-11 11:02:30,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=685813.3333333334, ans=0.125 2023-10-11 11:02:55,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=685906.6666666666, ans=0.1 2023-10-11 11:03:05,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=685953.3333333334, ans=0.0 2023-10-11 11:03:30,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-11 11:03:32,208 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2023-10-11 11:03:33,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2023-10-11 11:03:49,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=686140.0, ans=0.1 2023-10-11 11:03:55,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-10-11 11:03:56,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.652e+02 1.811e+02 2.036e+02 3.092e+02, threshold=3.622e+02, percent-clipped=0.0 2023-10-11 11:04:07,728 INFO [train.py:1031] (3/4) Epoch 11, batch 10500, loss[loss=0.1902, simple_loss=0.2852, pruned_loss=0.04763, over 16972.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2922, pruned_loss=0.05718, over 32698637.36 frames. ], batch size: 77, lr: 3.20e-03, grad_scale: 32.0 2023-10-11 11:04:14,865 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:04:22,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=686280.0, ans=0.1 2023-10-11 11:05:12,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-10-11 11:05:16,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=686513.3333333334, ans=0.125 2023-10-11 11:05:35,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=686560.0, ans=0.0 2023-10-11 11:05:36,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.04 vs. limit=22.5 2023-10-11 11:05:37,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-10-11 11:05:46,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=686606.6666666666, ans=0.0 2023-10-11 11:05:48,911 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:05:51,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.715e+02 1.897e+02 2.272e+02 3.484e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-11 11:06:08,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=686700.0, ans=0.0 2023-10-11 11:06:21,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2023-10-11 11:06:34,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=686793.3333333334, ans=0.125 2023-10-11 11:06:55,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=686840.0, ans=0.125 2023-10-11 11:06:59,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=686886.6666666666, ans=0.1 2023-10-11 11:07:17,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=686933.3333333334, ans=0.1 2023-10-11 11:07:33,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=687026.6666666666, ans=0.0 2023-10-11 11:07:39,003 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:07:39,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=687026.6666666666, ans=0.0 2023-10-11 11:07:56,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.681e+02 1.798e+02 2.037e+02 2.693e+02, threshold=3.596e+02, percent-clipped=0.0 2023-10-11 11:08:00,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=687120.0, ans=0.0 2023-10-11 11:08:00,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=687120.0, ans=0.125 2023-10-11 11:08:11,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=687166.6666666666, ans=0.0 2023-10-11 11:08:13,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=687166.6666666666, ans=0.2 2023-10-11 11:08:17,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=687166.6666666666, ans=0.07 2023-10-11 11:08:22,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=687213.3333333334, ans=0.025 2023-10-11 11:08:37,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2023-10-11 11:08:45,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-11 11:08:53,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=687306.6666666666, ans=0.04949747468305833 2023-10-11 11:09:02,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=687353.3333333334, ans=0.125 2023-10-11 11:09:02,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=12.0 2023-10-11 11:09:06,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.71 vs. limit=15.0 2023-10-11 11:09:16,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=687446.6666666666, ans=0.1 2023-10-11 11:09:25,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=687446.6666666666, ans=0.125 2023-10-11 11:09:32,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=687493.3333333334, ans=0.125 2023-10-11 11:09:50,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.729e+02 1.941e+02 2.214e+02 2.848e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-11 11:10:17,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=687680.0, ans=0.2 2023-10-11 11:10:17,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.63 vs. limit=15.0 2023-10-11 11:10:23,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=687726.6666666666, ans=0.0 2023-10-11 11:10:23,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-10-11 11:10:39,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=687773.3333333334, ans=0.125 2023-10-11 11:10:42,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=687773.3333333334, ans=0.0 2023-10-11 11:10:46,251 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:10:49,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=687820.0, ans=0.0 2023-10-11 11:10:59,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=687866.6666666666, ans=6.0 2023-10-11 11:11:00,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=687866.6666666666, ans=0.0 2023-10-11 11:11:05,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=687913.3333333334, ans=0.07 2023-10-11 11:11:16,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=687960.0, ans=0.0 2023-10-11 11:11:18,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=687960.0, ans=0.125 2023-10-11 11:11:29,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=688006.6666666666, ans=0.125 2023-10-11 11:11:29,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=688006.6666666666, ans=0.125 2023-10-11 11:11:30,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=688006.6666666666, ans=0.0 2023-10-11 11:11:35,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=688006.6666666666, ans=0.125 2023-10-11 11:11:38,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.662e+02 1.873e+02 2.031e+02 2.688e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 11:11:52,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=688100.0, ans=0.1 2023-10-11 11:12:20,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=688193.3333333334, ans=0.95 2023-10-11 11:12:22,226 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:12:28,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=688240.0, ans=0.1 2023-10-11 11:12:32,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=688240.0, ans=0.0 2023-10-11 11:13:06,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2023-10-11 11:13:06,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=688426.6666666666, ans=0.0 2023-10-11 11:13:11,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-10-11 11:13:27,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.652e+02 1.849e+02 2.074e+02 2.989e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 11:13:35,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-11 11:13:40,392 INFO [train.py:1031] (3/4) Epoch 11, batch 11000, loss[loss=0.2108, simple_loss=0.265, pruned_loss=0.07828, over 12203.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2922, pruned_loss=0.05727, over 32714262.24 frames. ], batch size: 440, lr: 3.20e-03, grad_scale: 32.0 2023-10-11 11:13:47,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=688566.6666666666, ans=0.1 2023-10-11 11:14:02,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-10-11 11:14:13,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=688706.6666666666, ans=0.125 2023-10-11 11:14:14,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=688706.6666666666, ans=0.125 2023-10-11 11:14:14,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.94 vs. limit=15.0 2023-10-11 11:14:19,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0 2023-10-11 11:14:20,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-10-11 11:14:36,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=688800.0, ans=0.0 2023-10-11 11:14:40,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=688800.0, ans=0.125 2023-10-11 11:14:56,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=688846.6666666666, ans=0.1 2023-10-11 11:15:02,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=688893.3333333334, ans=0.0 2023-10-11 11:15:03,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.26 vs. limit=15.0 2023-10-11 11:15:06,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=688893.3333333334, ans=0.1 2023-10-11 11:15:23,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.777e+02 1.963e+02 2.195e+02 3.096e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-11 11:15:47,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=689033.3333333334, ans=0.125 2023-10-11 11:15:54,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=689080.0, ans=0.2 2023-10-11 11:16:11,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=689126.6666666666, ans=0.2 2023-10-11 11:16:11,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=689126.6666666666, ans=0.0 2023-10-11 11:16:17,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=689173.3333333334, ans=0.1 2023-10-11 11:16:40,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=689266.6666666666, ans=0.125 2023-10-11 11:16:41,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.23 vs. limit=22.5 2023-10-11 11:16:46,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=689266.6666666666, ans=0.0 2023-10-11 11:16:53,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=689313.3333333334, ans=0.125 2023-10-11 11:16:56,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=689313.3333333334, ans=0.1 2023-10-11 11:17:04,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=689360.0, ans=0.0 2023-10-11 11:17:11,760 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:17:17,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.22 vs. limit=10.0 2023-10-11 11:17:23,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=689406.6666666666, ans=0.125 2023-10-11 11:17:24,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.311e+02 1.604e+02 1.777e+02 1.987e+02 3.348e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-11 11:17:26,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=689453.3333333334, ans=0.0 2023-10-11 11:17:31,626 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-11 11:17:32,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=689453.3333333334, ans=0.125 2023-10-11 11:17:39,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=689500.0, ans=10.0 2023-10-11 11:17:41,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=689500.0, ans=0.04949747468305833 2023-10-11 11:17:42,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=689500.0, ans=0.125 2023-10-11 11:18:01,953 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-11 11:18:25,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=689686.6666666666, ans=0.125 2023-10-11 11:18:46,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=689780.0, ans=0.035 2023-10-11 11:19:16,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=689873.3333333334, ans=0.025 2023-10-11 11:19:16,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=689873.3333333334, ans=0.2 2023-10-11 11:19:19,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.634e+02 1.766e+02 1.938e+02 2.673e+02, threshold=3.533e+02, percent-clipped=0.0 2023-10-11 11:19:32,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=689920.0, ans=0.125 2023-10-11 11:19:34,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=689966.6666666666, ans=0.125 2023-10-11 11:19:36,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.56 vs. limit=15.0 2023-10-11 11:19:41,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=689966.6666666666, ans=0.0 2023-10-11 11:19:43,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=689966.6666666666, ans=0.2 2023-10-11 11:19:59,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=690060.0, ans=0.0 2023-10-11 11:20:04,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.88 vs. limit=15.0 2023-10-11 11:20:05,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.54 vs. limit=10.0 2023-10-11 11:20:07,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=690106.6666666666, ans=0.125 2023-10-11 11:20:17,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-10-11 11:20:25,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=690153.3333333334, ans=0.0 2023-10-11 11:20:25,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=690153.3333333334, ans=0.1 2023-10-11 11:20:34,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.10 vs. limit=15.0 2023-10-11 11:20:43,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-10-11 11:21:14,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.664e+02 1.841e+02 2.018e+02 2.858e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-11 11:21:20,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=690386.6666666666, ans=0.0 2023-10-11 11:21:25,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-10-11 11:21:36,502 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:21:46,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.86 vs. limit=15.0 2023-10-11 11:22:07,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=690573.3333333334, ans=0.125 2023-10-11 11:22:18,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=690620.0, ans=0.1 2023-10-11 11:22:34,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=12.0 2023-10-11 11:22:42,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=690760.0, ans=0.2 2023-10-11 11:22:45,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=690760.0, ans=0.0 2023-10-11 11:22:51,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=690760.0, ans=0.0 2023-10-11 11:23:04,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.753e+02 1.926e+02 2.155e+02 2.473e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-11 11:23:11,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2023-10-11 11:23:15,396 INFO [train.py:1031] (3/4) Epoch 11, batch 11500, loss[loss=0.198, simple_loss=0.2937, pruned_loss=0.05112, over 16582.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2919, pruned_loss=0.05714, over 32736416.63 frames. ], batch size: 66, lr: 3.19e-03, grad_scale: 32.0 2023-10-11 11:23:15,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=690900.0, ans=0.125 2023-10-11 11:23:39,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=690993.3333333334, ans=0.1 2023-10-11 11:23:44,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=690993.3333333334, ans=0.125 2023-10-11 11:23:52,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=691040.0, ans=6.0 2023-10-11 11:24:06,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=691086.6666666666, ans=0.0 2023-10-11 11:24:19,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.08 vs. limit=15.0 2023-10-11 11:24:38,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=691226.6666666666, ans=0.125 2023-10-11 11:25:00,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.654e+02 1.866e+02 2.027e+02 2.889e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 11:25:06,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.66 vs. limit=22.5 2023-10-11 11:25:10,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=691320.0, ans=0.2 2023-10-11 11:25:13,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-10-11 11:25:20,289 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:25:26,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=691413.3333333334, ans=0.125 2023-10-11 11:25:42,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691460.0, ans=0.1 2023-10-11 11:26:24,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691646.6666666666, ans=0.1 2023-10-11 11:26:54,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.728e+02 1.884e+02 2.043e+02 2.720e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 11:27:05,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-10-11 11:27:08,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-10-11 11:27:13,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=691833.3333333334, ans=0.04949747468305833 2023-10-11 11:27:17,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=691880.0, ans=0.125 2023-10-11 11:27:19,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=691880.0, ans=0.125 2023-10-11 11:27:53,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=692020.0, ans=0.125 2023-10-11 11:28:06,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=692066.6666666666, ans=0.1 2023-10-11 11:28:17,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=692113.3333333334, ans=0.0 2023-10-11 11:28:25,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2023-10-11 11:28:48,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=692206.6666666666, ans=0.0 2023-10-11 11:28:53,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=692206.6666666666, ans=0.0 2023-10-11 11:28:57,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.693e+02 1.881e+02 2.177e+02 3.153e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 11:28:59,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.75 vs. limit=12.0 2023-10-11 11:29:19,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=692346.6666666666, ans=0.125 2023-10-11 11:29:52,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=692440.0, ans=0.07 2023-10-11 11:30:00,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=692486.6666666666, ans=0.0 2023-10-11 11:30:03,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.60 vs. limit=15.0 2023-10-11 11:30:11,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.43 vs. limit=22.5 2023-10-11 11:30:34,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=692626.6666666666, ans=0.95 2023-10-11 11:30:34,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=692626.6666666666, ans=0.125 2023-10-11 11:30:55,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.618e+02 1.758e+02 1.943e+02 2.614e+02, threshold=3.515e+02, percent-clipped=0.0 2023-10-11 11:31:09,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=692766.6666666666, ans=0.1 2023-10-11 11:31:10,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=692766.6666666666, ans=0.1 2023-10-11 11:31:17,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-10-11 11:31:18,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=692813.3333333334, ans=0.0 2023-10-11 11:31:21,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=692813.3333333334, ans=0.0 2023-10-11 11:31:48,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=692953.3333333334, ans=0.125 2023-10-11 11:32:09,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693000.0, ans=0.1 2023-10-11 11:32:37,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=693140.0, ans=0.125 2023-10-11 11:32:46,664 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.673e+02 1.812e+02 2.109e+02 2.831e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-11 11:32:56,970 INFO [train.py:1031] (3/4) Epoch 11, batch 12000, loss[loss=0.1897, simple_loss=0.2824, pruned_loss=0.04855, over 16825.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.292, pruned_loss=0.05695, over 32752482.99 frames. ], batch size: 98, lr: 3.19e-03, grad_scale: 32.0 2023-10-11 11:33:00,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=693233.3333333334, ans=0.125 2023-10-11 11:33:04,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=693233.3333333334, ans=0.2 2023-10-11 11:33:17,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-10-11 11:33:20,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=693326.6666666666, ans=10.0 2023-10-11 11:33:42,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=693420.0, ans=0.125 2023-10-11 11:33:44,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=693420.0, ans=0.2 2023-10-11 11:33:53,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=693420.0, ans=0.0 2023-10-11 11:33:57,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=693466.6666666666, ans=0.125 2023-10-11 11:34:01,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=693466.6666666666, ans=0.125 2023-10-11 11:34:07,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=693513.3333333334, ans=0.0 2023-10-11 11:34:19,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=693560.0, ans=0.125 2023-10-11 11:34:28,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.66 vs. limit=22.5 2023-10-11 11:34:30,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-10-11 11:34:37,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-11 11:34:38,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=693606.6666666666, ans=0.125 2023-10-11 11:34:38,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=693606.6666666666, ans=0.125 2023-10-11 11:34:41,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.693e+02 1.835e+02 2.193e+02 3.345e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-11 11:34:43,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=693653.3333333334, ans=0.0 2023-10-11 11:35:02,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=693746.6666666666, ans=0.0 2023-10-11 11:35:13,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=693793.3333333334, ans=0.2 2023-10-11 11:35:19,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=693793.3333333334, ans=0.2 2023-10-11 11:35:23,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=693840.0, ans=0.125 2023-10-11 11:35:38,974 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-11 11:35:39,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=693886.6666666666, ans=0.0 2023-10-11 11:35:59,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-10-11 11:35:59,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=693980.0, ans=0.125 2023-10-11 11:36:00,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-11 11:36:04,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=694026.6666666666, ans=0.0 2023-10-11 11:36:10,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-10-11 11:36:25,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.647e+02 1.757e+02 1.953e+02 2.933e+02, threshold=3.514e+02, percent-clipped=0.0 2023-10-11 11:36:45,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=694213.3333333334, ans=0.1 2023-10-11 11:36:45,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=694213.3333333334, ans=0.125 2023-10-11 11:36:48,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=694213.3333333334, ans=0.0 2023-10-11 11:36:51,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.72 vs. limit=22.5 2023-10-11 11:37:00,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=694260.0, ans=0.0 2023-10-11 11:37:39,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=15.0 2023-10-11 11:38:12,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.687e+02 1.908e+02 2.116e+02 2.786e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 11:38:34,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=694680.0, ans=0.0 2023-10-11 11:38:37,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=694680.0, ans=0.125 2023-10-11 11:39:00,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=694773.3333333334, ans=0.0 2023-10-11 11:39:42,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=694913.3333333334, ans=0.125 2023-10-11 11:39:42,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=694913.3333333334, ans=0.125 2023-10-11 11:39:45,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=694960.0, ans=0.125 2023-10-11 11:39:49,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=694960.0, ans=0.125 2023-10-11 11:39:53,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.69 vs. limit=15.0 2023-10-11 11:39:55,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=695006.6666666666, ans=0.015 2023-10-11 11:39:56,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=695006.6666666666, ans=0.2 2023-10-11 11:40:07,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.52 vs. limit=6.0 2023-10-11 11:40:08,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.753e+02 1.944e+02 2.242e+02 3.448e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-11 11:40:57,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.96 vs. limit=6.0 2023-10-11 11:40:58,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=695240.0, ans=0.125 2023-10-11 11:40:59,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=695240.0, ans=0.125 2023-10-11 11:41:02,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=695286.6666666666, ans=0.125 2023-10-11 11:41:21,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=695333.3333333334, ans=0.125 2023-10-11 11:41:22,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695333.3333333334, ans=0.1 2023-10-11 11:41:36,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=695380.0, ans=0.125 2023-10-11 11:41:37,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=695426.6666666666, ans=0.0 2023-10-11 11:41:40,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=695426.6666666666, ans=0.2 2023-10-11 11:42:02,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.707e+02 1.866e+02 2.136e+02 2.912e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 11:42:11,899 INFO [train.py:1031] (3/4) Epoch 11, batch 12500, loss[loss=0.2103, simple_loss=0.2961, pruned_loss=0.0623, over 16612.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2917, pruned_loss=0.05704, over 32754384.63 frames. ], batch size: 219, lr: 3.18e-03, grad_scale: 16.0 2023-10-11 11:42:16,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=695566.6666666666, ans=0.0 2023-10-11 11:42:29,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=695613.3333333334, ans=0.05 2023-10-11 11:42:48,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695706.6666666666, ans=0.1 2023-10-11 11:42:57,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=695753.3333333334, ans=0.125 2023-10-11 11:43:00,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=695753.3333333334, ans=0.125 2023-10-11 11:43:14,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=695846.6666666666, ans=0.125 2023-10-11 11:43:17,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695846.6666666666, ans=0.1 2023-10-11 11:43:37,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=695940.0, ans=0.0 2023-10-11 11:43:42,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-11 11:43:43,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=695940.0, ans=0.1 2023-10-11 11:43:48,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.705e+02 1.875e+02 2.066e+02 3.692e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-11 11:43:50,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=695986.6666666666, ans=0.125 2023-10-11 11:43:51,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=695986.6666666666, ans=0.125 2023-10-11 11:44:12,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=696080.0, ans=0.125 2023-10-11 11:44:24,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.20 vs. limit=22.5 2023-10-11 11:44:28,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=15.0 2023-10-11 11:44:36,817 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:44:37,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-11 11:44:42,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.25 vs. limit=10.0 2023-10-11 11:45:01,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=696266.6666666666, ans=0.0 2023-10-11 11:45:04,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.81 vs. limit=15.0 2023-10-11 11:45:04,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-10-11 11:45:26,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-11 11:45:26,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=696360.0, ans=0.125 2023-10-11 11:45:39,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=696406.6666666666, ans=0.0 2023-10-11 11:45:48,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.671e+02 1.811e+02 1.977e+02 2.636e+02, threshold=3.621e+02, percent-clipped=0.0 2023-10-11 11:46:08,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=696546.6666666666, ans=0.0 2023-10-11 11:46:12,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=696546.6666666666, ans=0.125 2023-10-11 11:46:25,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=696593.3333333334, ans=0.0 2023-10-11 11:46:35,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=696640.0, ans=0.0 2023-10-11 11:46:48,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=696686.6666666666, ans=0.1 2023-10-11 11:46:58,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=696733.3333333334, ans=0.125 2023-10-11 11:47:00,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=696733.3333333334, ans=0.125 2023-10-11 11:47:03,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=696733.3333333334, ans=0.125 2023-10-11 11:47:25,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=696873.3333333334, ans=0.125 2023-10-11 11:47:34,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=696873.3333333334, ans=0.125 2023-10-11 11:47:38,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=696920.0, ans=0.125 2023-10-11 11:47:39,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.753e+02 1.969e+02 2.182e+02 3.072e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-11 11:47:50,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=696966.6666666666, ans=0.125 2023-10-11 11:48:15,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=697060.0, ans=0.125 2023-10-11 11:48:25,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=697106.6666666666, ans=0.125 2023-10-11 11:48:35,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=697153.3333333334, ans=0.125 2023-10-11 11:48:38,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=697153.3333333334, ans=0.125 2023-10-11 11:48:38,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-10-11 11:48:40,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=697200.0, ans=0.1 2023-10-11 11:48:42,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=697200.0, ans=0.1 2023-10-11 11:48:56,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=22.5 2023-10-11 11:49:00,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=697246.6666666666, ans=0.1 2023-10-11 11:49:02,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=697246.6666666666, ans=0.125 2023-10-11 11:49:17,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=697340.0, ans=0.2 2023-10-11 11:49:18,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=697340.0, ans=15.0 2023-10-11 11:49:21,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=697340.0, ans=0.07 2023-10-11 11:49:31,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.670e+02 1.834e+02 2.087e+02 3.451e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 11:49:34,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=697386.6666666666, ans=0.1 2023-10-11 11:49:51,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=697480.0, ans=0.0 2023-10-11 11:49:56,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.81 vs. limit=15.0 2023-10-11 11:49:57,318 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:50:04,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-10-11 11:50:19,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=697573.3333333334, ans=0.125 2023-10-11 11:50:20,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-10-11 11:50:22,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=697620.0, ans=0.0 2023-10-11 11:50:24,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=697620.0, ans=0.125 2023-10-11 11:50:57,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-10-11 11:51:13,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=697806.6666666666, ans=0.1 2023-10-11 11:51:18,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.741e+02 1.907e+02 2.098e+02 3.619e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 11:51:22,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=697853.3333333334, ans=0.125 2023-10-11 11:51:26,936 INFO [train.py:1031] (3/4) Epoch 11, batch 13000, loss[loss=0.1844, simple_loss=0.2829, pruned_loss=0.04297, over 16769.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2922, pruned_loss=0.05718, over 32728625.65 frames. ], batch size: 81, lr: 3.18e-03, grad_scale: 32.0 2023-10-11 11:51:42,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=697946.6666666666, ans=0.125 2023-10-11 11:51:43,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=697946.6666666666, ans=0.125 2023-10-11 11:52:12,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2023-10-11 11:52:13,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=698040.0, ans=0.125 2023-10-11 11:52:51,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-11 11:53:01,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=698226.6666666666, ans=0.1 2023-10-11 11:53:03,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=698226.6666666666, ans=0.125 2023-10-11 11:53:03,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=698226.6666666666, ans=0.125 2023-10-11 11:53:03,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.70 vs. limit=15.0 2023-10-11 11:53:06,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.54 vs. limit=22.5 2023-10-11 11:53:20,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.15 vs. limit=15.0 2023-10-11 11:53:22,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.683e+02 1.884e+02 2.147e+02 2.915e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-11 11:53:27,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=698320.0, ans=0.125 2023-10-11 11:53:41,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=698413.3333333334, ans=0.125 2023-10-11 11:53:53,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=698460.0, ans=0.1 2023-10-11 11:54:01,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=698506.6666666666, ans=0.1 2023-10-11 11:54:15,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=698553.3333333334, ans=0.5 2023-10-11 11:54:15,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=698553.3333333334, ans=0.0 2023-10-11 11:54:17,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=698553.3333333334, ans=0.0 2023-10-11 11:54:20,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.34 vs. limit=6.0 2023-10-11 11:54:34,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=698600.0, ans=0.2 2023-10-11 11:54:35,050 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:54:39,773 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:54:40,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=698646.6666666666, ans=0.125 2023-10-11 11:54:47,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.99 vs. limit=15.0 2023-10-11 11:55:18,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.598e+02 1.748e+02 1.943e+02 2.768e+02, threshold=3.496e+02, percent-clipped=0.0 2023-10-11 11:55:51,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=698926.6666666666, ans=15.0 2023-10-11 11:55:51,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=698926.6666666666, ans=0.125 2023-10-11 11:56:03,843 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.59 vs. limit=22.5 2023-10-11 11:56:05,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=698973.3333333334, ans=0.125 2023-10-11 11:56:17,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=699020.0, ans=0.0 2023-10-11 11:56:17,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=699020.0, ans=0.0 2023-10-11 11:56:19,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=699020.0, ans=0.1 2023-10-11 11:56:46,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=699160.0, ans=0.125 2023-10-11 11:56:47,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=15.0 2023-10-11 11:57:06,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=699206.6666666666, ans=0.125 2023-10-11 11:57:10,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.665e+02 1.851e+02 2.207e+02 3.448e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-11 11:57:12,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-11 11:57:30,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=699300.0, ans=0.125 2023-10-11 11:57:44,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=699393.3333333334, ans=0.1 2023-10-11 11:57:51,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=699393.3333333334, ans=0.2 2023-10-11 11:58:02,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=699440.0, ans=0.1 2023-10-11 11:58:25,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=699580.0, ans=0.0 2023-10-11 11:58:36,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=699626.6666666666, ans=0.125 2023-10-11 11:59:00,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.696e+02 1.847e+02 2.051e+02 2.903e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-11 11:59:04,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=699720.0, ans=0.125 2023-10-11 11:59:06,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=699720.0, ans=0.1 2023-10-11 11:59:08,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=699766.6666666666, ans=0.0 2023-10-11 11:59:35,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=699860.0, ans=0.125 2023-10-11 11:59:58,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=699953.3333333334, ans=0.025 2023-10-11 12:00:26,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=700046.6666666666, ans=0.125 2023-10-11 12:00:38,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700140.0, ans=0.1 2023-10-11 12:00:42,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-11 12:00:42,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=700140.0, ans=0.125 2023-10-11 12:00:52,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.691e+02 1.846e+02 2.124e+02 3.601e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-11 12:00:55,138 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:01:00,734 INFO [train.py:1031] (3/4) Epoch 11, batch 13500, loss[loss=0.1963, simple_loss=0.2858, pruned_loss=0.05339, over 16920.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2916, pruned_loss=0.05696, over 32719249.99 frames. ], batch size: 130, lr: 3.17e-03, grad_scale: 32.0 2023-10-11 12:01:30,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=700326.6666666666, ans=0.125 2023-10-11 12:01:34,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=700326.6666666666, ans=0.0 2023-10-11 12:01:52,730 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:01:59,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=700466.6666666666, ans=0.0 2023-10-11 12:02:05,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=700466.6666666666, ans=0.05 2023-10-11 12:02:12,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=700513.3333333334, ans=0.125 2023-10-11 12:02:17,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=700513.3333333334, ans=0.2 2023-10-11 12:02:19,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=700513.3333333334, ans=0.0 2023-10-11 12:02:33,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=700606.6666666666, ans=15.0 2023-10-11 12:02:44,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=700653.3333333334, ans=0.0 2023-10-11 12:02:47,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.767e+02 1.991e+02 2.377e+02 3.541e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-11 12:02:52,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=700653.3333333334, ans=0.125 2023-10-11 12:02:53,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.02 vs. limit=10.0 2023-10-11 12:03:01,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700700.0, ans=0.1 2023-10-11 12:03:14,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-10-11 12:03:21,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=700793.3333333334, ans=0.125 2023-10-11 12:03:21,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-10-11 12:03:34,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-10-11 12:04:16,812 INFO [train.py:1031] (3/4) Epoch 12, batch 0, loss[loss=0.1701, simple_loss=0.2633, pruned_loss=0.03842, over 15997.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2633, pruned_loss=0.03842, over 15997.00 frames. ], batch size: 43, lr: 3.02e-03, grad_scale: 32.0 2023-10-11 12:04:16,813 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-11 12:04:24,789 INFO [train.py:1063] (3/4) Epoch 12, validation: loss=0.2194, simple_loss=0.3063, pruned_loss=0.06626, over 1020973.00 frames. 2023-10-11 12:04:24,791 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-11 12:04:39,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=701003.3333333334, ans=0.07 2023-10-11 12:04:54,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=701050.0, ans=0.125 2023-10-11 12:05:09,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.817e+02 2.063e+02 2.337e+02 3.932e+02, threshold=4.127e+02, percent-clipped=0.0 2023-10-11 12:05:10,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=701096.6666666666, ans=0.125 2023-10-11 12:05:11,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=701096.6666666666, ans=0.125 2023-10-11 12:05:25,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=701190.0, ans=0.125 2023-10-11 12:05:42,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=701236.6666666666, ans=0.0 2023-10-11 12:06:00,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=701330.0, ans=0.0 2023-10-11 12:06:16,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=701376.6666666666, ans=0.5 2023-10-11 12:06:53,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=701563.3333333334, ans=0.125 2023-10-11 12:07:02,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.661e+02 1.853e+02 2.086e+02 3.214e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-11 12:07:07,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=701610.0, ans=0.0 2023-10-11 12:07:08,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=701610.0, ans=0.125 2023-10-11 12:07:08,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=701610.0, ans=0.1 2023-10-11 12:07:42,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=701750.0, ans=10.0 2023-10-11 12:07:49,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=701796.6666666666, ans=0.2 2023-10-11 12:07:50,652 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:07:51,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=701796.6666666666, ans=0.2 2023-10-11 12:07:54,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-11 12:07:57,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=701796.6666666666, ans=0.0 2023-10-11 12:08:17,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=701890.0, ans=0.2 2023-10-11 12:08:21,585 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:08:22,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=701936.6666666666, ans=0.0 2023-10-11 12:08:48,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.725e+02 1.904e+02 2.193e+02 3.110e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-11 12:08:58,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.61 vs. limit=10.0 2023-10-11 12:09:09,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=702123.3333333334, ans=0.125 2023-10-11 12:09:29,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-11 12:09:38,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=702263.3333333334, ans=0.125 2023-10-11 12:09:46,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=702263.3333333334, ans=0.1 2023-10-11 12:09:47,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=702263.3333333334, ans=0.5 2023-10-11 12:10:00,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=702310.0, ans=0.125 2023-10-11 12:10:04,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=702356.6666666666, ans=0.125 2023-10-11 12:10:08,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=22.5 2023-10-11 12:10:24,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-10-11 12:10:43,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.710e+02 1.854e+02 2.151e+02 3.128e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 12:10:46,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=702543.3333333334, ans=0.0 2023-10-11 12:10:58,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702590.0, ans=0.1 2023-10-11 12:11:04,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-10-11 12:11:08,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=22.5 2023-10-11 12:11:16,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-10-11 12:11:31,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-10-11 12:11:47,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=702776.6666666666, ans=0.2 2023-10-11 12:12:03,439 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:12:22,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-10-11 12:12:30,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.689e+02 1.827e+02 2.030e+02 2.887e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-11 12:12:49,686 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:12:56,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=703056.6666666666, ans=0.125 2023-10-11 12:13:13,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=703150.0, ans=0.125 2023-10-11 12:13:43,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.85 vs. limit=15.0 2023-10-11 12:13:44,078 INFO [train.py:1031] (3/4) Epoch 12, batch 500, loss[loss=0.1729, simple_loss=0.2691, pruned_loss=0.03839, over 16890.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2917, pruned_loss=0.05671, over 7308322.51 frames. ], batch size: 87, lr: 3.02e-03, grad_scale: 32.0 2023-10-11 12:13:50,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=703290.0, ans=0.05 2023-10-11 12:14:01,988 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:14:04,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=703336.6666666666, ans=0.2 2023-10-11 12:14:12,864 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:14:26,561 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-11 12:14:26,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.707e+02 1.954e+02 2.239e+02 3.077e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-11 12:14:28,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=703476.6666666666, ans=0.125 2023-10-11 12:14:33,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.03 vs. limit=15.0 2023-10-11 12:14:38,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=703476.6666666666, ans=0.1 2023-10-11 12:14:39,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.09 vs. limit=15.0 2023-10-11 12:14:55,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=703570.0, ans=0.125 2023-10-11 12:15:12,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=703616.6666666666, ans=0.2 2023-10-11 12:15:12,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.74 vs. limit=15.0 2023-10-11 12:15:24,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=703663.3333333334, ans=0.125 2023-10-11 12:15:36,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.14 vs. limit=10.0 2023-10-11 12:15:56,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=703803.3333333334, ans=0.09899494936611666 2023-10-11 12:15:57,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=703803.3333333334, ans=0.025 2023-10-11 12:16:17,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=703896.6666666666, ans=0.125 2023-10-11 12:16:21,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.727e+02 1.857e+02 2.055e+02 2.712e+02, threshold=3.713e+02, percent-clipped=0.0 2023-10-11 12:16:22,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=703943.3333333334, ans=0.0 2023-10-11 12:16:41,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-10-11 12:16:50,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=704036.6666666666, ans=0.1 2023-10-11 12:16:53,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=704083.3333333334, ans=0.2 2023-10-11 12:16:58,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=704083.3333333334, ans=0.0 2023-10-11 12:17:00,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=704083.3333333334, ans=15.0 2023-10-11 12:17:18,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=704176.6666666666, ans=0.0 2023-10-11 12:17:24,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=704223.3333333334, ans=0.5 2023-10-11 12:17:28,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=704223.3333333334, ans=0.1 2023-10-11 12:17:36,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=704270.0, ans=0.0 2023-10-11 12:17:37,270 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:17:43,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=704270.0, ans=0.125 2023-10-11 12:18:00,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=704363.3333333334, ans=0.125 2023-10-11 12:18:02,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=704363.3333333334, ans=0.0 2023-10-11 12:18:09,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.775e+02 1.993e+02 2.220e+02 2.926e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-11 12:18:22,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=704410.0, ans=0.2 2023-10-11 12:18:44,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=704503.3333333334, ans=0.125 2023-10-11 12:18:47,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=704550.0, ans=0.0 2023-10-11 12:18:53,319 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=22.5 2023-10-11 12:18:53,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=704550.0, ans=0.09899494936611666 2023-10-11 12:19:01,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=704596.6666666666, ans=0.0 2023-10-11 12:19:01,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=704596.6666666666, ans=0.1 2023-10-11 12:19:08,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.42 vs. limit=15.0 2023-10-11 12:19:16,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704643.3333333334, ans=0.1 2023-10-11 12:19:18,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=704690.0, ans=0.125 2023-10-11 12:19:23,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=704690.0, ans=0.0 2023-10-11 12:19:24,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=704690.0, ans=0.125 2023-10-11 12:19:45,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.58 vs. limit=22.5 2023-10-11 12:19:53,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-10-11 12:19:56,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=704830.0, ans=0.0 2023-10-11 12:20:06,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=704830.0, ans=0.1 2023-10-11 12:20:10,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.263e+02 1.682e+02 1.809e+02 2.020e+02 2.684e+02, threshold=3.618e+02, percent-clipped=0.0 2023-10-11 12:20:27,560 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.70 vs. limit=15.0 2023-10-11 12:20:41,869 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:20:45,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=705016.6666666666, ans=0.125 2023-10-11 12:20:51,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=705016.6666666666, ans=0.125 2023-10-11 12:20:54,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-10-11 12:20:55,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=705063.3333333334, ans=0.0 2023-10-11 12:20:55,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=705063.3333333334, ans=0.125 2023-10-11 12:21:17,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=705156.6666666666, ans=0.04949747468305833 2023-10-11 12:21:21,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=705156.6666666666, ans=0.05 2023-10-11 12:21:22,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=705156.6666666666, ans=0.0 2023-10-11 12:21:24,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=705156.6666666666, ans=0.125 2023-10-11 12:21:38,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705250.0, ans=0.1 2023-10-11 12:21:53,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=705296.6666666666, ans=0.1 2023-10-11 12:22:01,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.666e+02 1.897e+02 2.120e+02 3.296e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-11 12:22:02,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=705343.3333333334, ans=0.125 2023-10-11 12:22:28,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-11 12:22:35,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.27 vs. limit=15.0 2023-10-11 12:22:43,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=15.0 2023-10-11 12:22:45,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=705483.3333333334, ans=0.125 2023-10-11 12:22:52,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=705530.0, ans=0.0 2023-10-11 12:22:54,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705530.0, ans=0.1 2023-10-11 12:22:54,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=705530.0, ans=0.125 2023-10-11 12:23:09,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=705576.6666666666, ans=0.125 2023-10-11 12:23:12,005 INFO [train.py:1031] (3/4) Epoch 12, batch 1000, loss[loss=0.2031, simple_loss=0.2915, pruned_loss=0.05737, over 16631.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2919, pruned_loss=0.0567, over 12970531.42 frames. ], batch size: 56, lr: 3.01e-03, grad_scale: 32.0 2023-10-11 12:23:12,395 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:23:19,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=705623.3333333334, ans=0.125 2023-10-11 12:23:26,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=705670.0, ans=0.125 2023-10-11 12:23:27,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=705670.0, ans=0.0 2023-10-11 12:23:39,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=705716.6666666666, ans=0.2 2023-10-11 12:23:52,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=705763.3333333334, ans=0.0 2023-10-11 12:23:52,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=705763.3333333334, ans=0.125 2023-10-11 12:23:52,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.682e+02 1.814e+02 2.051e+02 2.981e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-11 12:23:59,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2023-10-11 12:24:05,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=705856.6666666666, ans=0.0 2023-10-11 12:24:36,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=705996.6666666666, ans=0.2 2023-10-11 12:24:44,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=706043.3333333334, ans=0.2 2023-10-11 12:24:56,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=706090.0, ans=0.125 2023-10-11 12:24:57,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=706090.0, ans=0.0 2023-10-11 12:25:01,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=706090.0, ans=0.0 2023-10-11 12:25:03,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=706090.0, ans=0.0 2023-10-11 12:25:06,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.07 vs. limit=22.5 2023-10-11 12:25:34,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=706183.3333333334, ans=0.05 2023-10-11 12:25:34,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=706183.3333333334, ans=0.2 2023-10-11 12:25:45,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=706230.0, ans=0.125 2023-10-11 12:25:46,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.766e+02 1.940e+02 2.109e+02 2.953e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-11 12:25:49,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=706276.6666666666, ans=0.0 2023-10-11 12:25:52,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.52 vs. limit=22.5 2023-10-11 12:26:15,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=706370.0, ans=0.125 2023-10-11 12:26:20,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=706370.0, ans=0.125 2023-10-11 12:26:34,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=706416.6666666666, ans=0.125 2023-10-11 12:26:42,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=706463.3333333334, ans=0.1 2023-10-11 12:26:49,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=706510.0, ans=0.125 2023-10-11 12:27:02,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=706556.6666666666, ans=0.0 2023-10-11 12:27:05,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=706556.6666666666, ans=0.1 2023-10-11 12:27:08,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=706556.6666666666, ans=0.125 2023-10-11 12:27:09,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=706556.6666666666, ans=0.0 2023-10-11 12:27:10,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=706603.3333333334, ans=0.0 2023-10-11 12:27:18,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-11 12:27:38,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=12.0 2023-10-11 12:27:40,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=706696.6666666666, ans=0.125 2023-10-11 12:27:47,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.603e+02 1.795e+02 2.057e+02 2.830e+02, threshold=3.591e+02, percent-clipped=0.0 2023-10-11 12:27:52,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=706743.3333333334, ans=0.0 2023-10-11 12:28:11,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=706836.6666666666, ans=0.125 2023-10-11 12:28:16,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=706836.6666666666, ans=0.125 2023-10-11 12:28:23,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=22.5 2023-10-11 12:28:32,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=706930.0, ans=0.015 2023-10-11 12:28:59,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=707023.3333333334, ans=0.125 2023-10-11 12:29:34,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.259e+02 1.602e+02 1.772e+02 1.976e+02 2.706e+02, threshold=3.544e+02, percent-clipped=0.0 2023-10-11 12:29:45,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=707256.6666666666, ans=0.0 2023-10-11 12:30:26,982 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=22.5 2023-10-11 12:30:49,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2023-10-11 12:30:57,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=707536.6666666666, ans=0.125 2023-10-11 12:30:59,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=707536.6666666666, ans=0.0 2023-10-11 12:30:59,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=707536.6666666666, ans=0.0 2023-10-11 12:31:02,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=707583.3333333334, ans=0.1 2023-10-11 12:31:10,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-10-11 12:31:19,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=707630.0, ans=0.125 2023-10-11 12:31:25,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.665e+02 1.867e+02 2.077e+02 3.015e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 12:31:38,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=707723.3333333334, ans=0.125 2023-10-11 12:31:42,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-11 12:31:48,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-11 12:32:04,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=707816.6666666666, ans=0.0 2023-10-11 12:32:41,897 INFO [train.py:1031] (3/4) Epoch 12, batch 1500, loss[loss=0.1879, simple_loss=0.278, pruned_loss=0.04888, over 16871.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2905, pruned_loss=0.05588, over 17389348.04 frames. ], batch size: 110, lr: 3.01e-03, grad_scale: 32.0 2023-10-11 12:32:56,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-10-11 12:32:57,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.35 vs. limit=10.0 2023-10-11 12:33:20,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=708096.6666666666, ans=0.035 2023-10-11 12:33:20,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=708096.6666666666, ans=0.125 2023-10-11 12:33:25,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.649e+02 1.784e+02 2.034e+02 2.580e+02, threshold=3.568e+02, percent-clipped=0.0 2023-10-11 12:33:26,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=708143.3333333334, ans=0.0 2023-10-11 12:33:44,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=708190.0, ans=0.09899494936611666 2023-10-11 12:33:50,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=708236.6666666666, ans=0.125 2023-10-11 12:33:53,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=708236.6666666666, ans=0.0 2023-10-11 12:33:54,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708236.6666666666, ans=0.1 2023-10-11 12:34:00,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=708283.3333333334, ans=0.125 2023-10-11 12:34:45,531 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.53 vs. limit=10.0 2023-10-11 12:34:54,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=708470.0, ans=0.125 2023-10-11 12:34:57,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=708516.6666666666, ans=0.125 2023-10-11 12:34:59,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=708516.6666666666, ans=0.125 2023-10-11 12:35:08,145 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:35:24,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.654e+02 1.810e+02 2.078e+02 2.616e+02, threshold=3.619e+02, percent-clipped=0.0 2023-10-11 12:35:37,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=708656.6666666666, ans=10.0 2023-10-11 12:35:42,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.32 vs. limit=12.0 2023-10-11 12:35:46,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=708703.3333333334, ans=0.1 2023-10-11 12:36:04,978 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.58 vs. limit=10.0 2023-10-11 12:36:13,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=708796.6666666666, ans=0.125 2023-10-11 12:36:17,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=708796.6666666666, ans=0.2 2023-10-11 12:36:19,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-10-11 12:36:31,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=708890.0, ans=0.0 2023-10-11 12:36:33,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=708890.0, ans=0.125 2023-10-11 12:36:36,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=708890.0, ans=0.04949747468305833 2023-10-11 12:36:47,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=708936.6666666666, ans=0.125 2023-10-11 12:36:47,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=708936.6666666666, ans=0.125 2023-10-11 12:37:04,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=709030.0, ans=0.0 2023-10-11 12:37:06,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.51 vs. limit=22.5 2023-10-11 12:37:12,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.679e+02 1.871e+02 2.086e+02 2.821e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 12:37:14,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=709076.6666666666, ans=0.125 2023-10-11 12:37:24,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=709123.3333333334, ans=0.125 2023-10-11 12:37:25,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=709123.3333333334, ans=0.0 2023-10-11 12:37:29,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.12 vs. limit=15.0 2023-10-11 12:37:33,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=22.5 2023-10-11 12:37:36,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-11 12:38:04,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=709263.3333333334, ans=0.05 2023-10-11 12:38:32,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=709356.6666666666, ans=0.125 2023-10-11 12:38:38,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=709403.3333333334, ans=0.1 2023-10-11 12:38:50,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=709450.0, ans=0.125 2023-10-11 12:39:09,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.636e+02 1.801e+02 2.020e+02 3.521e+02, threshold=3.602e+02, percent-clipped=0.0 2023-10-11 12:39:10,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=709543.3333333334, ans=0.0 2023-10-11 12:39:15,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=709543.3333333334, ans=0.2 2023-10-11 12:39:15,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=709543.3333333334, ans=0.1 2023-10-11 12:39:35,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=709636.6666666666, ans=0.125 2023-10-11 12:39:54,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=709683.3333333334, ans=0.0 2023-10-11 12:39:59,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-10-11 12:40:17,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=709823.3333333334, ans=0.1 2023-10-11 12:40:32,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=709870.0, ans=0.0 2023-10-11 12:40:43,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=709916.6666666666, ans=0.0 2023-10-11 12:40:44,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=709916.6666666666, ans=0.1 2023-10-11 12:40:52,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=709963.3333333334, ans=0.125 2023-10-11 12:41:03,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.751e+02 1.939e+02 2.153e+02 2.870e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-11 12:41:31,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.29 vs. limit=22.5 2023-10-11 12:41:46,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=710103.3333333334, ans=0.125 2023-10-11 12:42:06,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=710196.6666666666, ans=0.125 2023-10-11 12:42:10,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=12.0 2023-10-11 12:42:13,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=710243.3333333334, ans=0.0 2023-10-11 12:42:25,833 INFO [train.py:1031] (3/4) Epoch 12, batch 2000, loss[loss=0.2072, simple_loss=0.2987, pruned_loss=0.05784, over 16879.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2908, pruned_loss=0.05602, over 20791750.87 frames. ], batch size: 110, lr: 3.00e-03, grad_scale: 32.0 2023-10-11 12:42:40,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710336.6666666666, ans=0.1 2023-10-11 12:42:44,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=710336.6666666666, ans=0.1 2023-10-11 12:43:08,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=710430.0, ans=0.2 2023-10-11 12:43:20,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-10-11 12:43:20,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.700e+02 1.863e+02 2.117e+02 2.937e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-11 12:43:35,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-10-11 12:43:41,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-10-11 12:43:50,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=710570.0, ans=0.125 2023-10-11 12:43:53,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=710570.0, ans=0.0 2023-10-11 12:43:56,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=22.5 2023-10-11 12:45:05,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-10-11 12:45:18,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=710850.0, ans=0.125 2023-10-11 12:45:33,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=710896.6666666666, ans=0.0 2023-10-11 12:45:35,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.681e+02 1.916e+02 2.186e+02 2.783e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 12:45:43,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=710943.3333333334, ans=15.0 2023-10-11 12:45:50,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=710990.0, ans=0.0 2023-10-11 12:45:53,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=710990.0, ans=0.025 2023-10-11 12:46:34,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=711130.0, ans=0.125 2023-10-11 12:46:52,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.55 vs. limit=22.5 2023-10-11 12:47:08,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711270.0, ans=0.1 2023-10-11 12:47:09,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=711270.0, ans=0.5 2023-10-11 12:47:20,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=711363.3333333334, ans=0.125 2023-10-11 12:47:29,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=711363.3333333334, ans=0.1 2023-10-11 12:47:31,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.825e+02 1.995e+02 2.262e+02 3.131e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-11 12:47:36,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-11 12:47:45,980 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:47:47,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=711456.6666666666, ans=0.0 2023-10-11 12:47:51,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=711456.6666666666, ans=0.0 2023-10-11 12:48:02,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=711550.0, ans=0.125 2023-10-11 12:48:11,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=711550.0, ans=0.2 2023-10-11 12:48:12,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=711550.0, ans=0.0 2023-10-11 12:48:19,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=711596.6666666666, ans=0.125 2023-10-11 12:48:27,413 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.08 vs. limit=15.0 2023-10-11 12:48:31,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=711643.3333333334, ans=0.0 2023-10-11 12:48:35,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=711690.0, ans=0.2 2023-10-11 12:48:37,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=711690.0, ans=0.125 2023-10-11 12:48:54,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=711736.6666666666, ans=0.0 2023-10-11 12:49:03,896 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:49:16,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=711830.0, ans=0.125 2023-10-11 12:49:18,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.760e+02 1.937e+02 2.187e+02 3.216e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-11 12:49:25,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=711876.6666666666, ans=0.2 2023-10-11 12:49:51,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=711970.0, ans=0.025 2023-10-11 12:50:03,347 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:50:04,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=712063.3333333334, ans=0.125 2023-10-11 12:50:06,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.23 vs. limit=15.0 2023-10-11 12:50:24,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=712110.0, ans=0.0 2023-10-11 12:50:46,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=712250.0, ans=0.125 2023-10-11 12:50:55,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=712250.0, ans=0.2 2023-10-11 12:51:07,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.697e+02 1.854e+02 2.019e+02 2.574e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-11 12:51:18,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=712390.0, ans=0.125 2023-10-11 12:51:28,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=712390.0, ans=0.125 2023-10-11 12:52:06,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.26 vs. limit=15.0 2023-10-11 12:52:13,096 INFO [train.py:1031] (3/4) Epoch 12, batch 2500, loss[loss=0.1824, simple_loss=0.286, pruned_loss=0.03939, over 16888.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2912, pruned_loss=0.05639, over 23438074.10 frames. ], batch size: 104, lr: 3.00e-03, grad_scale: 32.0 2023-10-11 12:52:13,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=712623.3333333334, ans=0.125 2023-10-11 12:52:22,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=712670.0, ans=0.0 2023-10-11 12:52:29,212 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:52:33,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=712716.6666666666, ans=0.07 2023-10-11 12:52:40,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=712716.6666666666, ans=0.125 2023-10-11 12:52:49,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=712763.3333333334, ans=0.125 2023-10-11 12:52:53,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.764e+02 1.957e+02 2.259e+02 3.026e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-11 12:53:05,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=712856.6666666666, ans=0.0 2023-10-11 12:53:36,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=712950.0, ans=0.05 2023-10-11 12:53:56,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713043.3333333334, ans=0.1 2023-10-11 12:53:59,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=713043.3333333334, ans=0.125 2023-10-11 12:54:30,732 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:54:36,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=713230.0, ans=0.125 2023-10-11 12:54:39,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.08 vs. limit=10.0 2023-10-11 12:54:40,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=713230.0, ans=0.125 2023-10-11 12:54:46,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.664e+02 1.833e+02 2.136e+02 2.919e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-11 12:54:54,224 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:55:23,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=713416.6666666666, ans=0.0 2023-10-11 12:55:23,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=713416.6666666666, ans=0.2 2023-10-11 12:55:30,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=713463.3333333334, ans=0.125 2023-10-11 12:55:37,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=713463.3333333334, ans=0.07 2023-10-11 12:55:44,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=713510.0, ans=0.125 2023-10-11 12:55:45,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=713510.0, ans=0.1 2023-10-11 12:56:09,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=713603.3333333334, ans=0.125 2023-10-11 12:56:39,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=713696.6666666666, ans=0.09899494936611666 2023-10-11 12:56:40,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=713696.6666666666, ans=0.2 2023-10-11 12:56:42,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.692e+02 1.877e+02 2.135e+02 2.987e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-11 12:57:11,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=713836.6666666666, ans=0.125 2023-10-11 12:57:21,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=713883.3333333334, ans=0.1 2023-10-11 12:57:39,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=713930.0, ans=0.0 2023-10-11 12:57:39,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=713930.0, ans=0.125 2023-10-11 12:58:07,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=714023.3333333334, ans=0.0 2023-10-11 12:58:09,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=714070.0, ans=0.125 2023-10-11 12:58:13,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=714070.0, ans=0.0 2023-10-11 12:58:31,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=714163.3333333334, ans=0.125 2023-10-11 12:58:33,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=714163.3333333334, ans=0.0 2023-10-11 12:58:47,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.684e+02 1.904e+02 2.108e+02 2.908e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-11 12:59:11,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=714303.3333333334, ans=0.125 2023-10-11 12:59:18,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=714303.3333333334, ans=0.1 2023-10-11 12:59:18,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=714303.3333333334, ans=0.125 2023-10-11 12:59:24,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=714350.0, ans=0.0 2023-10-11 12:59:28,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=714350.0, ans=0.1 2023-10-11 12:59:35,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=714350.0, ans=0.125 2023-10-11 13:00:02,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.53 vs. limit=10.0 2023-10-11 13:00:04,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-11 13:00:10,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=714490.0, ans=0.09899494936611666 2023-10-11 13:00:53,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.676e+02 1.838e+02 2.055e+02 2.871e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-11 13:00:54,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=714676.6666666666, ans=0.2 2023-10-11 13:01:01,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=714676.6666666666, ans=0.0 2023-10-11 13:01:02,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=714676.6666666666, ans=0.125 2023-10-11 13:01:10,881 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:01:21,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.98 vs. limit=15.0 2023-10-11 13:01:58,182 INFO [train.py:1031] (3/4) Epoch 12, batch 3000, loss[loss=0.1898, simple_loss=0.2831, pruned_loss=0.04821, over 16903.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2903, pruned_loss=0.0564, over 25490784.38 frames. ], batch size: 77, lr: 2.99e-03, grad_scale: 32.0 2023-10-11 13:02:19,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=715050.0, ans=0.125 2023-10-11 13:02:40,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=715143.3333333334, ans=0.2 2023-10-11 13:02:41,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.650e+02 1.873e+02 2.068e+02 3.296e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 13:03:13,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-10-11 13:03:39,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=715376.6666666666, ans=0.125 2023-10-11 13:04:17,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=715516.6666666666, ans=0.125 2023-10-11 13:04:24,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=715516.6666666666, ans=0.0 2023-10-11 13:04:39,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.853e+02 2.050e+02 2.371e+02 3.733e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-11 13:05:00,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=715656.6666666666, ans=0.1 2023-10-11 13:05:37,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.12 vs. limit=10.0 2023-10-11 13:05:41,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=715843.3333333334, ans=0.125 2023-10-11 13:05:41,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=715843.3333333334, ans=0.2 2023-10-11 13:05:46,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.47 vs. limit=10.0 2023-10-11 13:05:53,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=715890.0, ans=0.125 2023-10-11 13:05:55,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=715890.0, ans=0.125 2023-10-11 13:05:55,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-10-11 13:06:21,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=715983.3333333334, ans=0.1 2023-10-11 13:06:33,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=716030.0, ans=0.0 2023-10-11 13:06:40,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.684e+02 1.873e+02 2.026e+02 3.332e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-11 13:06:56,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=716123.3333333334, ans=0.0 2023-10-11 13:07:22,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=716216.6666666666, ans=0.125 2023-10-11 13:07:28,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=12.0 2023-10-11 13:07:30,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=716263.3333333334, ans=0.2 2023-10-11 13:07:30,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=716263.3333333334, ans=0.0 2023-10-11 13:07:33,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=716263.3333333334, ans=0.125 2023-10-11 13:07:35,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=716263.3333333334, ans=15.0 2023-10-11 13:07:39,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=716310.0, ans=0.125 2023-10-11 13:07:45,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=716310.0, ans=0.125 2023-10-11 13:07:45,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=716310.0, ans=0.0 2023-10-11 13:07:46,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=716310.0, ans=0.1 2023-10-11 13:07:54,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=716356.6666666666, ans=0.125 2023-10-11 13:08:15,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2023-10-11 13:08:23,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=716450.0, ans=0.0 2023-10-11 13:08:36,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.700e+02 1.946e+02 2.132e+02 2.838e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-11 13:08:40,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=716543.3333333334, ans=0.1 2023-10-11 13:08:44,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=716590.0, ans=0.0 2023-10-11 13:08:46,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=716590.0, ans=0.1 2023-10-11 13:08:47,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=716590.0, ans=0.125 2023-10-11 13:08:49,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=716590.0, ans=0.0 2023-10-11 13:09:03,803 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:09:26,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=716730.0, ans=0.125 2023-10-11 13:09:27,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=716730.0, ans=0.0 2023-10-11 13:09:33,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-10-11 13:09:42,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=716776.6666666666, ans=0.125 2023-10-11 13:09:50,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=716823.3333333334, ans=0.0 2023-10-11 13:10:27,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.26 vs. limit=22.5 2023-10-11 13:10:28,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.724e+02 1.891e+02 2.086e+02 2.802e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-11 13:10:34,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-11 13:11:34,840 INFO [train.py:1031] (3/4) Epoch 12, batch 3500, loss[loss=0.2249, simple_loss=0.3112, pruned_loss=0.06927, over 16930.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2904, pruned_loss=0.05661, over 27116248.87 frames. ], batch size: 110, lr: 2.99e-03, grad_scale: 32.0 2023-10-11 13:11:54,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=717336.6666666666, ans=0.125 2023-10-11 13:11:59,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=717383.3333333334, ans=0.2 2023-10-11 13:12:00,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=717383.3333333334, ans=0.125 2023-10-11 13:12:01,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=717383.3333333334, ans=0.125 2023-10-11 13:12:01,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=717383.3333333334, ans=0.5 2023-10-11 13:12:12,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717430.0, ans=0.1 2023-10-11 13:12:18,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.685e+02 1.808e+02 1.987e+02 2.523e+02, threshold=3.616e+02, percent-clipped=0.0 2023-10-11 13:12:19,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.05 vs. limit=15.0 2023-10-11 13:12:49,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=717570.0, ans=0.125 2023-10-11 13:13:10,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=717663.3333333334, ans=0.0 2023-10-11 13:13:15,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=717663.3333333334, ans=0.125 2023-10-11 13:13:29,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=717710.0, ans=0.2 2023-10-11 13:13:48,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=717803.3333333334, ans=0.125 2023-10-11 13:14:08,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.74 vs. limit=22.5 2023-10-11 13:14:09,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=717896.6666666666, ans=0.1 2023-10-11 13:14:17,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.685e+02 1.910e+02 2.125e+02 3.000e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-11 13:14:42,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=718036.6666666666, ans=0.125 2023-10-11 13:15:04,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.36 vs. limit=15.0 2023-10-11 13:15:16,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=718176.6666666666, ans=0.0 2023-10-11 13:15:36,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-10-11 13:15:51,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=718316.6666666666, ans=0.2 2023-10-11 13:16:16,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.657e+02 1.878e+02 2.181e+02 2.770e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 13:16:24,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=718456.6666666666, ans=0.05 2023-10-11 13:16:29,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=718456.6666666666, ans=0.125 2023-10-11 13:16:46,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718503.3333333334, ans=0.1 2023-10-11 13:16:54,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=718550.0, ans=0.2 2023-10-11 13:17:10,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=718596.6666666666, ans=0.1 2023-10-11 13:17:13,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=718643.3333333334, ans=0.125 2023-10-11 13:17:34,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=718690.0, ans=0.0 2023-10-11 13:17:50,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=718783.3333333334, ans=0.125 2023-10-11 13:17:54,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-10-11 13:18:07,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718830.0, ans=0.1 2023-10-11 13:18:13,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.726e+02 1.839e+02 2.012e+02 2.587e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-11 13:18:23,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=718923.3333333334, ans=15.0 2023-10-11 13:18:24,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=718923.3333333334, ans=0.05 2023-10-11 13:18:34,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=718970.0, ans=0.125 2023-10-11 13:18:38,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718970.0, ans=0.1 2023-10-11 13:18:43,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=719016.6666666666, ans=0.0 2023-10-11 13:19:18,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=719156.6666666666, ans=0.0 2023-10-11 13:19:24,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=719156.6666666666, ans=0.125 2023-10-11 13:19:26,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=719156.6666666666, ans=0.125 2023-10-11 13:19:43,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=719250.0, ans=0.025 2023-10-11 13:19:53,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=719296.6666666666, ans=0.125 2023-10-11 13:19:58,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=719296.6666666666, ans=0.125 2023-10-11 13:20:02,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.674e+02 1.932e+02 2.186e+02 2.864e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-11 13:20:17,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=719390.0, ans=0.125 2023-10-11 13:20:45,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-10-11 13:20:46,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=719530.0, ans=0.2 2023-10-11 13:21:00,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719576.6666666666, ans=0.1 2023-10-11 13:21:09,402 INFO [train.py:1031] (3/4) Epoch 12, batch 4000, loss[loss=0.1806, simple_loss=0.267, pruned_loss=0.04708, over 16492.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2901, pruned_loss=0.05681, over 28354331.76 frames. ], batch size: 50, lr: 2.98e-03, grad_scale: 32.0 2023-10-11 13:21:15,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=719623.3333333334, ans=0.0 2023-10-11 13:21:17,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719623.3333333334, ans=0.1 2023-10-11 13:21:24,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719670.0, ans=0.1 2023-10-11 13:21:29,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-10-11 13:21:30,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2023-10-11 13:21:34,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=719670.0, ans=0.0 2023-10-11 13:21:34,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=719670.0, ans=0.125 2023-10-11 13:21:35,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=719716.6666666666, ans=0.0 2023-10-11 13:21:44,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.97 vs. limit=10.0 2023-10-11 13:21:51,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=719763.3333333334, ans=0.0 2023-10-11 13:21:54,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=719763.3333333334, ans=0.0 2023-10-11 13:22:00,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.714e+02 1.907e+02 2.159e+02 2.863e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 13:22:09,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=719810.0, ans=0.1 2023-10-11 13:22:24,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=719903.3333333334, ans=0.2 2023-10-11 13:22:29,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=719903.3333333334, ans=0.125 2023-10-11 13:22:34,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=719950.0, ans=0.125 2023-10-11 13:22:57,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=720043.3333333334, ans=0.125 2023-10-11 13:22:59,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=720043.3333333334, ans=0.0 2023-10-11 13:23:14,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=720090.0, ans=0.95 2023-10-11 13:23:14,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=720090.0, ans=0.0 2023-10-11 13:23:34,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=720183.3333333334, ans=0.125 2023-10-11 13:23:41,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=720183.3333333334, ans=0.0 2023-10-11 13:23:57,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.739e+02 1.919e+02 2.150e+02 2.720e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-11 13:24:40,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-10-11 13:25:12,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=720510.0, ans=0.035 2023-10-11 13:25:12,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=720510.0, ans=0.125 2023-10-11 13:25:21,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=720556.6666666666, ans=0.125 2023-10-11 13:25:37,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=720603.3333333334, ans=0.0 2023-10-11 13:25:41,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=720650.0, ans=0.0 2023-10-11 13:25:46,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=720650.0, ans=0.1 2023-10-11 13:26:03,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=720696.6666666666, ans=0.04949747468305833 2023-10-11 13:26:09,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.654e+02 1.805e+02 2.012e+02 2.655e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-11 13:26:20,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=720790.0, ans=0.1 2023-10-11 13:26:30,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=720836.6666666666, ans=0.2 2023-10-11 13:26:47,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=720883.3333333334, ans=22.5 2023-10-11 13:26:55,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=22.5 2023-10-11 13:27:07,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=720976.6666666666, ans=0.125 2023-10-11 13:27:26,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-10-11 13:27:51,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=721163.3333333334, ans=0.0 2023-10-11 13:27:58,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.724e+02 1.835e+02 2.048e+02 3.224e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-11 13:28:00,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-10-11 13:28:05,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=721210.0, ans=0.1 2023-10-11 13:28:35,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=721350.0, ans=0.125 2023-10-11 13:28:36,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=721350.0, ans=0.125 2023-10-11 13:28:59,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-10-11 13:29:00,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=721443.3333333334, ans=0.125 2023-10-11 13:29:02,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=721443.3333333334, ans=15.0 2023-10-11 13:29:09,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=721490.0, ans=0.0 2023-10-11 13:29:44,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=721630.0, ans=0.0 2023-10-11 13:29:58,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.740e+02 1.928e+02 2.135e+02 3.122e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-11 13:30:33,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=721816.6666666666, ans=0.125 2023-10-11 13:30:38,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=721816.6666666666, ans=0.125 2023-10-11 13:31:07,442 INFO [train.py:1031] (3/4) Epoch 12, batch 4500, loss[loss=0.1876, simple_loss=0.2827, pruned_loss=0.04629, over 16888.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2904, pruned_loss=0.05654, over 29353887.32 frames. ], batch size: 130, lr: 2.98e-03, grad_scale: 32.0 2023-10-11 13:31:44,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=722096.6666666666, ans=0.125 2023-10-11 13:31:52,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.657e+02 1.796e+02 2.027e+02 2.731e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 13:32:06,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=722190.0, ans=0.125 2023-10-11 13:32:21,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=722283.3333333334, ans=0.1 2023-10-11 13:32:24,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=722283.3333333334, ans=0.125 2023-10-11 13:32:25,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-10-11 13:32:34,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=722330.0, ans=0.02 2023-10-11 13:32:42,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722376.6666666666, ans=0.1 2023-10-11 13:32:55,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=722423.3333333334, ans=0.125 2023-10-11 13:33:08,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=722470.0, ans=0.125 2023-10-11 13:33:10,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-11 13:33:24,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=15.0 2023-10-11 13:33:27,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=722563.3333333334, ans=0.0 2023-10-11 13:33:36,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.712e+02 1.900e+02 2.088e+02 2.791e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-11 13:33:37,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722610.0, ans=0.1 2023-10-11 13:34:09,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.53 vs. limit=10.0 2023-10-11 13:34:17,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=722750.0, ans=0.09899494936611666 2023-10-11 13:34:22,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=722796.6666666666, ans=0.2 2023-10-11 13:34:27,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=722796.6666666666, ans=0.07 2023-10-11 13:34:37,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=722843.3333333334, ans=0.1 2023-10-11 13:34:38,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=722843.3333333334, ans=0.125 2023-10-11 13:34:43,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=722890.0, ans=0.125 2023-10-11 13:35:03,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.40 vs. limit=22.5 2023-10-11 13:35:26,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.721e+02 1.856e+02 1.989e+02 3.133e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 13:35:41,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-10-11 13:35:43,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=723123.3333333334, ans=0.125 2023-10-11 13:35:47,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=723170.0, ans=0.125 2023-10-11 13:36:08,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-10-11 13:36:17,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=723310.0, ans=0.125 2023-10-11 13:36:28,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=723356.6666666666, ans=0.125 2023-10-11 13:36:34,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=723356.6666666666, ans=0.0 2023-10-11 13:36:57,860 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.78 vs. limit=10.0 2023-10-11 13:36:58,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=723450.0, ans=0.125 2023-10-11 13:37:08,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=723496.6666666666, ans=0.0 2023-10-11 13:37:21,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-10-11 13:37:22,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.674e+02 1.834e+02 2.184e+02 3.536e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-11 13:37:25,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=22.5 2023-10-11 13:37:42,714 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-10-11 13:37:44,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.57 vs. limit=22.5 2023-10-11 13:37:50,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=723636.6666666666, ans=0.125 2023-10-11 13:38:35,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=723823.3333333334, ans=0.125 2023-10-11 13:38:37,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=723823.3333333334, ans=0.0 2023-10-11 13:38:55,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=723916.6666666666, ans=0.05 2023-10-11 13:39:11,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=723963.3333333334, ans=0.0 2023-10-11 13:39:17,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=724010.0, ans=0.125 2023-10-11 13:39:20,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.668e+02 1.843e+02 2.122e+02 2.826e+02, threshold=3.687e+02, percent-clipped=0.0 2023-10-11 13:39:53,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=724150.0, ans=0.025 2023-10-11 13:39:54,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=724150.0, ans=0.5 2023-10-11 13:40:00,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-10-11 13:40:21,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=724243.3333333334, ans=0.125 2023-10-11 13:40:23,649 INFO [train.py:1031] (3/4) Epoch 12, batch 5000, loss[loss=0.2192, simple_loss=0.3087, pruned_loss=0.06482, over 16822.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2902, pruned_loss=0.05656, over 30128124.23 frames. ], batch size: 188, lr: 2.97e-03, grad_scale: 32.0 2023-10-11 13:40:45,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=724336.6666666666, ans=0.125 2023-10-11 13:40:54,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.78 vs. limit=22.5 2023-10-11 13:41:11,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.712e+02 1.907e+02 2.232e+02 3.440e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 13:41:17,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=724476.6666666666, ans=0.0 2023-10-11 13:41:29,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=724523.3333333334, ans=0.0 2023-10-11 13:41:40,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=724570.0, ans=0.0 2023-10-11 13:41:50,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=724616.6666666666, ans=0.0 2023-10-11 13:42:02,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=724663.3333333334, ans=0.1 2023-10-11 13:42:31,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724803.3333333334, ans=0.1 2023-10-11 13:42:40,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=724850.0, ans=0.2 2023-10-11 13:42:47,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724850.0, ans=0.1 2023-10-11 13:42:47,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=724850.0, ans=0.125 2023-10-11 13:43:05,059 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.712e+02 1.870e+02 2.128e+02 2.757e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 13:43:24,833 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:43:26,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=725036.6666666666, ans=0.1 2023-10-11 13:43:58,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=725176.6666666666, ans=0.2 2023-10-11 13:44:13,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=725223.3333333334, ans=0.0 2023-10-11 13:44:15,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=725270.0, ans=0.125 2023-10-11 13:44:26,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=725270.0, ans=0.0 2023-10-11 13:44:44,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.75 vs. limit=10.0 2023-10-11 13:44:48,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=725410.0, ans=0.125 2023-10-11 13:44:50,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.675e+02 1.851e+02 2.061e+02 2.575e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-11 13:45:15,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2023-10-11 13:45:18,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-10-11 13:45:31,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=725550.0, ans=0.0 2023-10-11 13:45:34,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=725550.0, ans=0.125 2023-10-11 13:45:40,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=725596.6666666666, ans=0.125 2023-10-11 13:45:43,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=725596.6666666666, ans=0.125 2023-10-11 13:46:06,077 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:46:25,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=725783.3333333334, ans=0.125 2023-10-11 13:46:52,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.709e+02 1.890e+02 2.124e+02 3.310e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-11 13:46:52,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=725876.6666666666, ans=0.0 2023-10-11 13:46:54,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=725876.6666666666, ans=0.125 2023-10-11 13:47:21,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=725970.0, ans=0.125 2023-10-11 13:47:32,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=726016.6666666666, ans=0.125 2023-10-11 13:47:45,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-10-11 13:47:58,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=726156.6666666666, ans=0.2 2023-10-11 13:48:12,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=726203.3333333334, ans=0.125 2023-10-11 13:48:35,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=726296.6666666666, ans=0.125 2023-10-11 13:48:39,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.37 vs. limit=22.5 2023-10-11 13:48:49,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.716e+02 1.864e+02 2.153e+02 2.879e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-11 13:49:06,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=726390.0, ans=0.125 2023-10-11 13:49:12,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-11 13:49:23,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=726483.3333333334, ans=0.1 2023-10-11 13:49:27,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=726483.3333333334, ans=0.125 2023-10-11 13:49:28,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=726483.3333333334, ans=0.0 2023-10-11 13:49:33,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-10-11 13:49:36,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=726530.0, ans=0.125 2023-10-11 13:49:48,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=726576.6666666666, ans=0.125 2023-10-11 13:49:53,818 INFO [train.py:1031] (3/4) Epoch 12, batch 5500, loss[loss=0.1833, simple_loss=0.2761, pruned_loss=0.04522, over 16998.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.29, pruned_loss=0.05644, over 30724833.06 frames. ], batch size: 93, lr: 2.97e-03, grad_scale: 32.0 2023-10-11 13:49:54,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=726623.3333333334, ans=0.2 2023-10-11 13:50:02,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=726623.3333333334, ans=0.125 2023-10-11 13:50:03,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=726623.3333333334, ans=0.125 2023-10-11 13:50:08,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=726670.0, ans=0.125 2023-10-11 13:50:09,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=726670.0, ans=0.035 2023-10-11 13:50:20,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-10-11 13:50:38,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=726810.0, ans=0.125 2023-10-11 13:50:39,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.627e+02 1.796e+02 2.033e+02 2.468e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-11 13:50:42,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=726810.0, ans=0.95 2023-10-11 13:50:42,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726810.0, ans=0.1 2023-10-11 13:50:45,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=726810.0, ans=22.5 2023-10-11 13:50:48,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=726856.6666666666, ans=0.0 2023-10-11 13:50:49,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=726856.6666666666, ans=0.125 2023-10-11 13:50:49,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-10-11 13:50:50,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=726856.6666666666, ans=0.125 2023-10-11 13:50:59,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=726903.3333333334, ans=0.125 2023-10-11 13:51:02,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726903.3333333334, ans=0.1 2023-10-11 13:51:04,396 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:51:15,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=726950.0, ans=0.125 2023-10-11 13:51:21,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=726996.6666666666, ans=0.2 2023-10-11 13:51:27,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.39 vs. limit=15.0 2023-10-11 13:51:57,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-11 13:52:04,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.41 vs. limit=10.0 2023-10-11 13:52:27,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=727230.0, ans=0.125 2023-10-11 13:52:32,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.736e+02 1.937e+02 2.198e+02 4.352e+02, threshold=3.873e+02, percent-clipped=2.0 2023-10-11 13:52:41,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=727323.3333333334, ans=0.0 2023-10-11 13:52:43,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=22.5 2023-10-11 13:53:21,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-10-11 13:53:28,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-10-11 13:54:03,841 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:54:05,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=727650.0, ans=0.0 2023-10-11 13:54:07,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727650.0, ans=0.1 2023-10-11 13:54:19,084 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:54:19,413 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-10-11 13:54:27,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.697e+02 1.846e+02 2.049e+02 3.130e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-11 13:55:14,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=727930.0, ans=0.1 2023-10-11 13:55:15,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=727930.0, ans=0.125 2023-10-11 13:55:21,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=727930.0, ans=0.1 2023-10-11 13:55:29,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=727976.6666666666, ans=0.125 2023-10-11 13:56:13,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=728163.3333333334, ans=0.125 2023-10-11 13:56:22,044 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:56:23,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=15.0 2023-10-11 13:56:23,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.782e+02 1.966e+02 2.200e+02 3.503e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-11 13:56:29,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.30 vs. limit=22.5 2023-10-11 13:56:39,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=728256.6666666666, ans=0.125 2023-10-11 13:56:41,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-10-11 13:56:45,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=728303.3333333334, ans=0.125 2023-10-11 13:56:48,327 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:56:51,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=728303.3333333334, ans=0.125 2023-10-11 13:57:14,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=728396.6666666666, ans=0.2 2023-10-11 13:57:17,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=728443.3333333334, ans=0.125 2023-10-11 13:57:34,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=728490.0, ans=0.2 2023-10-11 13:58:20,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.595e+02 1.778e+02 1.991e+02 2.737e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-11 13:58:30,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=728723.3333333334, ans=0.1 2023-10-11 13:58:33,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=728723.3333333334, ans=0.1 2023-10-11 13:58:45,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=728770.0, ans=0.0 2023-10-11 13:58:52,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-10-11 13:59:00,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=728863.3333333334, ans=0.125 2023-10-11 13:59:04,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=728863.3333333334, ans=0.125 2023-10-11 13:59:08,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=728863.3333333334, ans=0.125 2023-10-11 13:59:14,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=728910.0, ans=15.0 2023-10-11 13:59:16,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-10-11 13:59:22,967 INFO [train.py:1031] (3/4) Epoch 12, batch 6000, loss[loss=0.2368, simple_loss=0.31, pruned_loss=0.0818, over 16031.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2904, pruned_loss=0.05674, over 31183852.53 frames. ], batch size: 296, lr: 2.97e-03, grad_scale: 32.0 2023-10-11 13:59:40,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=729003.3333333334, ans=0.125 2023-10-11 13:59:49,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-11 13:59:52,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=729050.0, ans=0.0 2023-10-11 14:00:12,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.865e+02 2.197e+02 2.522e+02 3.719e+02, threshold=4.393e+02, percent-clipped=1.0 2023-10-11 14:00:58,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2023-10-11 14:01:05,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=729376.6666666666, ans=0.0 2023-10-11 14:01:24,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=729423.3333333334, ans=0.0 2023-10-11 14:01:55,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=729563.3333333334, ans=0.0 2023-10-11 14:02:04,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.315e+02 1.709e+02 1.866e+02 2.170e+02 3.352e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-11 14:02:07,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=15.0 2023-10-11 14:02:18,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=729656.6666666666, ans=0.05 2023-10-11 14:02:45,544 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:02:49,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=729796.6666666666, ans=0.2 2023-10-11 14:03:35,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=729983.3333333334, ans=0.1 2023-10-11 14:03:57,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.754e+02 2.000e+02 2.274e+02 3.192e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-11 14:04:22,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=730170.0, ans=0.125 2023-10-11 14:04:35,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=730216.6666666666, ans=0.2 2023-10-11 14:04:36,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-11 14:05:13,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=730403.3333333334, ans=0.1 2023-10-11 14:05:15,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=730403.3333333334, ans=0.04949747468305833 2023-10-11 14:05:17,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=730403.3333333334, ans=0.0 2023-10-11 14:05:52,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.833e+02 1.980e+02 2.227e+02 3.223e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-11 14:06:01,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=15.0 2023-10-11 14:06:04,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=730590.0, ans=0.125 2023-10-11 14:06:20,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=730636.6666666666, ans=0.125 2023-10-11 14:06:27,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=730636.6666666666, ans=0.0 2023-10-11 14:06:36,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=730683.3333333334, ans=0.0 2023-10-11 14:06:39,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=730683.3333333334, ans=0.1 2023-10-11 14:06:40,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=730683.3333333334, ans=0.125 2023-10-11 14:06:42,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=730730.0, ans=0.1 2023-10-11 14:06:49,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-10-11 14:06:51,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=730730.0, ans=0.0 2023-10-11 14:07:08,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=730823.3333333334, ans=0.125 2023-10-11 14:07:19,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=730870.0, ans=0.0 2023-10-11 14:07:23,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=730870.0, ans=0.1 2023-10-11 14:07:31,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=730916.6666666666, ans=0.0 2023-10-11 14:07:34,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=730916.6666666666, ans=0.125 2023-10-11 14:07:52,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.607e+02 1.747e+02 1.965e+02 2.568e+02, threshold=3.494e+02, percent-clipped=0.0 2023-10-11 14:08:11,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-10-11 14:08:12,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=15.0 2023-10-11 14:08:56,490 INFO [train.py:1031] (3/4) Epoch 12, batch 6500, loss[loss=0.1955, simple_loss=0.2904, pruned_loss=0.05032, over 16896.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.291, pruned_loss=0.05702, over 31513928.23 frames. ], batch size: 104, lr: 2.96e-03, grad_scale: 32.0 2023-10-11 14:09:10,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=731336.6666666666, ans=0.1 2023-10-11 14:09:14,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.48 vs. limit=15.0 2023-10-11 14:09:30,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=731383.3333333334, ans=0.0 2023-10-11 14:09:55,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.783e+02 1.995e+02 2.190e+02 3.130e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-11 14:10:00,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=731476.6666666666, ans=0.2 2023-10-11 14:10:21,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=731570.0, ans=0.0 2023-10-11 14:10:22,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=731570.0, ans=0.0 2023-10-11 14:10:25,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=731616.6666666666, ans=0.125 2023-10-11 14:10:25,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=731616.6666666666, ans=0.125 2023-10-11 14:10:28,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=731616.6666666666, ans=0.1 2023-10-11 14:10:48,171 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-10-11 14:10:58,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=731756.6666666666, ans=0.07 2023-10-11 14:11:18,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=731803.3333333334, ans=0.125 2023-10-11 14:11:19,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=731803.3333333334, ans=0.2 2023-10-11 14:11:37,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=731896.6666666666, ans=0.0 2023-10-11 14:11:39,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=731896.6666666666, ans=0.1 2023-10-11 14:11:47,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.706e+02 1.869e+02 2.128e+02 3.077e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-11 14:11:52,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=731943.3333333334, ans=0.95 2023-10-11 14:12:01,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=731990.0, ans=0.0 2023-10-11 14:12:33,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=732130.0, ans=0.0 2023-10-11 14:12:45,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-10-11 14:13:23,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-10-11 14:13:26,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=732363.3333333334, ans=0.09899494936611666 2023-10-11 14:13:35,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.703e+02 1.919e+02 2.239e+02 3.995e+02, threshold=3.839e+02, percent-clipped=1.0 2023-10-11 14:13:51,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=732456.6666666666, ans=0.0 2023-10-11 14:14:04,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732503.3333333334, ans=0.1 2023-10-11 14:14:10,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=732550.0, ans=0.1 2023-10-11 14:14:18,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=732596.6666666666, ans=0.2 2023-10-11 14:14:19,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=732596.6666666666, ans=0.125 2023-10-11 14:14:31,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=732643.3333333334, ans=0.125 2023-10-11 14:15:04,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=732736.6666666666, ans=0.05 2023-10-11 14:15:09,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-11 14:15:12,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=732736.6666666666, ans=0.125 2023-10-11 14:15:18,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=732783.3333333334, ans=0.2 2023-10-11 14:15:19,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.27 vs. limit=15.0 2023-10-11 14:15:25,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=732830.0, ans=0.0 2023-10-11 14:15:28,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=12.0 2023-10-11 14:15:42,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.261e+02 1.617e+02 1.881e+02 2.174e+02 3.473e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 14:15:42,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=732876.6666666666, ans=0.125 2023-10-11 14:15:46,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732876.6666666666, ans=0.1 2023-10-11 14:15:54,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=732923.3333333334, ans=0.125 2023-10-11 14:16:09,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-10-11 14:16:25,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=733063.3333333334, ans=0.125 2023-10-11 14:16:35,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=733110.0, ans=0.025 2023-10-11 14:16:54,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-10-11 14:17:26,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=733296.6666666666, ans=0.125 2023-10-11 14:17:34,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.55 vs. limit=12.0 2023-10-11 14:17:37,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.679e+02 1.827e+02 2.286e+02 3.883e+02, threshold=3.654e+02, percent-clipped=1.0 2023-10-11 14:18:05,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=733483.3333333334, ans=0.125 2023-10-11 14:18:08,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=733483.3333333334, ans=0.0 2023-10-11 14:18:35,383 INFO [train.py:1031] (3/4) Epoch 12, batch 7000, loss[loss=0.1766, simple_loss=0.2767, pruned_loss=0.0383, over 16878.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2914, pruned_loss=0.05681, over 31804069.52 frames. ], batch size: 104, lr: 2.96e-03, grad_scale: 32.0 2023-10-11 14:18:51,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=733670.0, ans=0.0 2023-10-11 14:19:06,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-10-11 14:19:26,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.756e+02 1.979e+02 2.194e+02 3.652e+02, threshold=3.958e+02, percent-clipped=1.0 2023-10-11 14:19:34,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=733856.6666666666, ans=0.125 2023-10-11 14:19:48,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=733903.3333333334, ans=0.125 2023-10-11 14:19:49,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=733903.3333333334, ans=0.0 2023-10-11 14:19:53,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-11 14:19:57,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=733950.0, ans=0.2 2023-10-11 14:20:18,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=734043.3333333334, ans=0.125 2023-10-11 14:20:58,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=734230.0, ans=0.0 2023-10-11 14:21:10,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=734276.6666666666, ans=0.5 2023-10-11 14:21:14,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=734276.6666666666, ans=0.1 2023-10-11 14:21:15,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.703e+02 1.906e+02 2.130e+02 2.959e+02, threshold=3.812e+02, percent-clipped=0.0 2023-10-11 14:21:17,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=734276.6666666666, ans=0.0 2023-10-11 14:21:22,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=734323.3333333334, ans=0.125 2023-10-11 14:21:58,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=734463.3333333334, ans=0.0 2023-10-11 14:22:00,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=734463.3333333334, ans=0.1 2023-10-11 14:22:06,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=734510.0, ans=0.125 2023-10-11 14:22:17,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=734556.6666666666, ans=0.0 2023-10-11 14:23:13,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734743.3333333334, ans=0.1 2023-10-11 14:23:17,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.732e+02 1.991e+02 2.166e+02 2.951e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-11 14:23:18,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=734743.3333333334, ans=0.0 2023-10-11 14:23:22,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=734743.3333333334, ans=0.1 2023-10-11 14:23:34,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=734836.6666666666, ans=0.0 2023-10-11 14:23:41,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734836.6666666666, ans=0.1 2023-10-11 14:23:41,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=734836.6666666666, ans=0.125 2023-10-11 14:23:48,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=734883.3333333334, ans=0.1 2023-10-11 14:23:56,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=734883.3333333334, ans=0.0 2023-10-11 14:24:11,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734976.6666666666, ans=0.1 2023-10-11 14:24:32,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=22.5 2023-10-11 14:24:36,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=735070.0, ans=0.07 2023-10-11 14:24:41,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2023-10-11 14:25:13,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=735210.0, ans=0.125 2023-10-11 14:25:15,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.630e+02 1.776e+02 2.076e+02 3.236e+02, threshold=3.552e+02, percent-clipped=0.0 2023-10-11 14:25:16,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=735210.0, ans=0.0 2023-10-11 14:25:25,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=735256.6666666666, ans=0.0 2023-10-11 14:25:27,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=735256.6666666666, ans=0.125 2023-10-11 14:25:35,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=735303.3333333334, ans=0.125 2023-10-11 14:25:43,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=735303.3333333334, ans=15.0 2023-10-11 14:25:57,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=735396.6666666666, ans=0.125 2023-10-11 14:26:10,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=735443.3333333334, ans=0.1 2023-10-11 14:26:14,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=735443.3333333334, ans=0.125 2023-10-11 14:26:29,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=15.0 2023-10-11 14:26:46,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735583.3333333334, ans=0.1 2023-10-11 14:26:50,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.08 vs. limit=15.0 2023-10-11 14:26:58,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.69 vs. limit=15.0 2023-10-11 14:27:08,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.858e+02 2.131e+02 2.393e+02 3.396e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-11 14:27:35,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=735816.6666666666, ans=0.2 2023-10-11 14:27:45,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=735816.6666666666, ans=0.2 2023-10-11 14:28:03,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=735910.0, ans=0.0 2023-10-11 14:28:03,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=735910.0, ans=0.125 2023-10-11 14:28:06,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=735910.0, ans=0.2 2023-10-11 14:28:10,588 INFO [train.py:1031] (3/4) Epoch 12, batch 7500, loss[loss=0.1954, simple_loss=0.2651, pruned_loss=0.06286, over 12444.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2913, pruned_loss=0.05684, over 32040021.17 frames. ], batch size: 440, lr: 2.95e-03, grad_scale: 32.0 2023-10-11 14:28:19,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=735956.6666666666, ans=0.0 2023-10-11 14:28:43,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736096.6666666666, ans=0.1 2023-10-11 14:28:48,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=736096.6666666666, ans=0.0 2023-10-11 14:28:50,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736096.6666666666, ans=0.125 2023-10-11 14:28:56,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=736143.3333333334, ans=0.125 2023-10-11 14:28:59,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.764e+02 1.964e+02 2.248e+02 3.194e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-11 14:29:08,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.35 vs. limit=15.0 2023-10-11 14:29:15,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736190.0, ans=0.1 2023-10-11 14:29:26,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=736236.6666666666, ans=0.2 2023-10-11 14:29:28,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=736283.3333333334, ans=0.95 2023-10-11 14:29:43,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=736330.0, ans=0.125 2023-10-11 14:29:47,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=736330.0, ans=0.0 2023-10-11 14:29:58,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=736376.6666666666, ans=0.0 2023-10-11 14:30:03,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=736423.3333333334, ans=0.1 2023-10-11 14:30:17,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736470.0, ans=0.125 2023-10-11 14:30:22,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=736470.0, ans=0.125 2023-10-11 14:30:39,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=736563.3333333334, ans=0.0 2023-10-11 14:30:39,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=736563.3333333334, ans=0.125 2023-10-11 14:30:52,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=736563.3333333334, ans=0.0 2023-10-11 14:30:52,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=736563.3333333334, ans=0.04949747468305833 2023-10-11 14:31:02,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.689e+02 1.866e+02 2.163e+02 2.917e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-11 14:31:05,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-10-11 14:31:16,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=736656.6666666666, ans=0.125 2023-10-11 14:31:27,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=736703.3333333334, ans=0.1 2023-10-11 14:31:28,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-11 14:31:39,880 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:31:51,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2023-10-11 14:31:57,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=736843.3333333334, ans=0.125 2023-10-11 14:32:18,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=736936.6666666666, ans=0.125 2023-10-11 14:32:47,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-11 14:32:48,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=737076.6666666666, ans=0.07 2023-10-11 14:32:49,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=737076.6666666666, ans=0.125 2023-10-11 14:32:54,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.666e+02 1.818e+02 2.033e+02 2.567e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-11 14:33:02,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=737123.3333333334, ans=0.0 2023-10-11 14:33:04,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-10-11 14:33:11,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=737170.0, ans=0.0 2023-10-11 14:33:12,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-10-11 14:33:20,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=737170.0, ans=0.125 2023-10-11 14:33:31,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=737216.6666666666, ans=0.0 2023-10-11 14:33:36,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=737263.3333333334, ans=0.2 2023-10-11 14:33:39,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=737263.3333333334, ans=0.2 2023-10-11 14:33:49,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=737310.0, ans=0.125 2023-10-11 14:33:54,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=737310.0, ans=0.2 2023-10-11 14:33:58,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=737356.6666666666, ans=0.0 2023-10-11 14:34:00,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=737356.6666666666, ans=0.125 2023-10-11 14:34:04,358 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.92 vs. limit=15.0 2023-10-11 14:34:12,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=737403.3333333334, ans=0.1 2023-10-11 14:34:33,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=15.0 2023-10-11 14:34:53,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.740e+02 1.906e+02 2.061e+02 2.765e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-11 14:34:59,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=737590.0, ans=0.2 2023-10-11 14:35:02,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=737590.0, ans=0.1 2023-10-11 14:35:03,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=737590.0, ans=0.125 2023-10-11 14:35:18,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=737636.6666666666, ans=0.0 2023-10-11 14:35:40,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=12.0 2023-10-11 14:35:49,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-10-11 14:35:50,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=737776.6666666666, ans=0.0 2023-10-11 14:35:55,698 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:36:33,454 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-10-11 14:36:46,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.635e+02 1.890e+02 2.195e+02 3.054e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 14:36:48,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=738010.0, ans=0.0 2023-10-11 14:36:52,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=738056.6666666666, ans=0.125 2023-10-11 14:37:02,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=738056.6666666666, ans=0.125 2023-10-11 14:37:17,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=738150.0, ans=0.0 2023-10-11 14:37:22,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=738150.0, ans=0.125 2023-10-11 14:37:46,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738243.3333333334, ans=0.1 2023-10-11 14:37:51,590 INFO [train.py:1031] (3/4) Epoch 12, batch 8000, loss[loss=0.2268, simple_loss=0.2942, pruned_loss=0.07971, over 15632.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2905, pruned_loss=0.05616, over 32205205.25 frames. ], batch size: 350, lr: 2.95e-03, grad_scale: 32.0 2023-10-11 14:38:12,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=738336.6666666666, ans=0.125 2023-10-11 14:38:29,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.07 vs. limit=6.0 2023-10-11 14:38:31,984 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:38:39,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.619e+02 1.745e+02 2.050e+02 3.654e+02, threshold=3.490e+02, percent-clipped=0.0 2023-10-11 14:38:55,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=15.0 2023-10-11 14:38:56,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=738570.0, ans=0.125 2023-10-11 14:38:56,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=15.0 2023-10-11 14:39:18,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.22 vs. limit=22.5 2023-10-11 14:40:06,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=738850.0, ans=0.125 2023-10-11 14:40:23,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=738943.3333333334, ans=0.125 2023-10-11 14:40:24,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.690e+02 1.907e+02 2.170e+02 3.150e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 14:40:33,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=738990.0, ans=0.1 2023-10-11 14:40:37,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.23 vs. limit=15.0 2023-10-11 14:40:39,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=738990.0, ans=0.125 2023-10-11 14:41:22,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=739130.0, ans=0.125 2023-10-11 14:41:43,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=739176.6666666666, ans=0.125 2023-10-11 14:42:17,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=739316.6666666666, ans=0.0 2023-10-11 14:42:25,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=739363.3333333334, ans=0.0 2023-10-11 14:42:39,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.688e+02 1.880e+02 2.176e+02 3.288e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 14:42:40,031 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:43:00,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=739503.3333333334, ans=0.125 2023-10-11 14:43:04,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=739503.3333333334, ans=0.0 2023-10-11 14:43:13,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.49 vs. limit=15.0 2023-10-11 14:43:13,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=22.5 2023-10-11 14:43:20,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=739596.6666666666, ans=0.2 2023-10-11 14:43:38,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=739643.3333333334, ans=0.0 2023-10-11 14:43:38,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.73 vs. limit=15.0 2023-10-11 14:43:52,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=739690.0, ans=0.2 2023-10-11 14:43:55,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=739736.6666666666, ans=0.125 2023-10-11 14:44:22,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=739830.0, ans=0.125 2023-10-11 14:44:25,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=739830.0, ans=0.0 2023-10-11 14:44:32,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.791e+02 2.007e+02 2.336e+02 3.042e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-11 14:44:33,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=739876.6666666666, ans=0.125 2023-10-11 14:44:53,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=739970.0, ans=0.125 2023-10-11 14:45:03,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=740016.6666666666, ans=0.125 2023-10-11 14:45:44,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.87 vs. limit=6.0 2023-10-11 14:45:50,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=740156.6666666666, ans=0.2 2023-10-11 14:45:51,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=740203.3333333334, ans=0.125 2023-10-11 14:46:02,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=740250.0, ans=0.0 2023-10-11 14:46:28,264 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:46:32,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.682e+02 1.826e+02 2.033e+02 3.552e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 14:46:47,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=740390.0, ans=0.05 2023-10-11 14:46:48,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=740390.0, ans=0.125 2023-10-11 14:46:48,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=740390.0, ans=0.0 2023-10-11 14:47:05,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=740483.3333333334, ans=0.125 2023-10-11 14:47:31,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=740576.6666666666, ans=0.0 2023-10-11 14:47:37,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=740576.6666666666, ans=0.2 2023-10-11 14:47:39,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=740623.3333333334, ans=0.125 2023-10-11 14:47:40,948 INFO [train.py:1031] (3/4) Epoch 12, batch 8500, loss[loss=0.2098, simple_loss=0.2984, pruned_loss=0.06062, over 16918.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2909, pruned_loss=0.05618, over 32353236.41 frames. ], batch size: 77, lr: 2.94e-03, grad_scale: 32.0 2023-10-11 14:48:26,699 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:48:30,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=740810.0, ans=0.125 2023-10-11 14:48:34,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.746e+02 1.916e+02 2.116e+02 2.668e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 14:48:39,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=740856.6666666666, ans=0.125 2023-10-11 14:49:10,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=740950.0, ans=0.5 2023-10-11 14:49:18,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=740996.6666666666, ans=0.07 2023-10-11 14:49:18,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=740996.6666666666, ans=0.0 2023-10-11 14:49:25,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=740996.6666666666, ans=0.125 2023-10-11 14:49:54,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=741090.0, ans=0.0 2023-10-11 14:49:54,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-10-11 14:50:00,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=741136.6666666666, ans=0.125 2023-10-11 14:50:01,824 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:50:15,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=741183.3333333334, ans=0.2 2023-10-11 14:50:28,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=741276.6666666666, ans=0.0 2023-10-11 14:50:28,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-11 14:50:35,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.665e+02 1.846e+02 2.084e+02 3.076e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-11 14:51:38,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2023-10-11 14:51:39,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-10-11 14:51:51,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=741556.6666666666, ans=0.125 2023-10-11 14:51:58,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=741603.3333333334, ans=0.0 2023-10-11 14:52:05,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=741603.3333333334, ans=0.125 2023-10-11 14:52:13,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=741650.0, ans=0.2 2023-10-11 14:52:17,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=741650.0, ans=0.0 2023-10-11 14:52:31,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=741743.3333333334, ans=0.0 2023-10-11 14:52:34,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=741743.3333333334, ans=0.125 2023-10-11 14:52:38,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-11 14:52:38,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.617e+02 1.789e+02 2.064e+02 3.014e+02, threshold=3.579e+02, percent-clipped=0.0 2023-10-11 14:52:41,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=741743.3333333334, ans=0.1 2023-10-11 14:52:44,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.03 vs. limit=15.0 2023-10-11 14:53:11,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=741883.3333333334, ans=0.0 2023-10-11 14:53:19,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=741883.3333333334, ans=0.125 2023-10-11 14:53:20,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=741930.0, ans=0.5 2023-10-11 14:54:32,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.728e+02 1.948e+02 2.312e+02 3.393e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-11 14:54:33,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=742210.0, ans=0.125 2023-10-11 14:54:36,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=742210.0, ans=0.125 2023-10-11 14:54:36,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=742210.0, ans=0.2 2023-10-11 14:54:43,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=742256.6666666666, ans=0.125 2023-10-11 14:54:45,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=742256.6666666666, ans=0.125 2023-10-11 14:54:50,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=742303.3333333334, ans=0.5 2023-10-11 14:54:59,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-10-11 14:55:37,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=742490.0, ans=0.0 2023-10-11 14:55:38,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=742490.0, ans=0.0 2023-10-11 14:56:00,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=742583.3333333334, ans=0.125 2023-10-11 14:56:12,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.16 vs. limit=10.0 2023-10-11 14:56:16,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=742676.6666666666, ans=0.2 2023-10-11 14:56:21,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.680e+02 1.806e+02 2.017e+02 3.495e+02, threshold=3.612e+02, percent-clipped=0.0 2023-10-11 14:56:32,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=742723.3333333334, ans=0.125 2023-10-11 14:56:49,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.42 vs. limit=15.0 2023-10-11 14:56:50,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=742816.6666666666, ans=0.2 2023-10-11 14:56:53,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=742816.6666666666, ans=0.1 2023-10-11 14:56:54,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=742816.6666666666, ans=0.125 2023-10-11 14:57:04,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=742863.3333333334, ans=0.125 2023-10-11 14:57:09,094 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:57:22,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=742956.6666666666, ans=0.125 2023-10-11 14:57:23,302 INFO [train.py:1031] (3/4) Epoch 12, batch 9000, loss[loss=0.1768, simple_loss=0.2816, pruned_loss=0.03606, over 16887.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2903, pruned_loss=0.05592, over 32461115.27 frames. ], batch size: 104, lr: 2.94e-03, grad_scale: 32.0 2023-10-11 14:57:23,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=742956.6666666666, ans=0.1 2023-10-11 14:57:24,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=742956.6666666666, ans=0.125 2023-10-11 14:57:40,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=743003.3333333334, ans=0.125 2023-10-11 14:57:42,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=743003.3333333334, ans=0.1 2023-10-11 14:57:49,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=743050.0, ans=0.125 2023-10-11 14:57:58,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=743096.6666666666, ans=0.2 2023-10-11 14:58:06,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=743143.3333333334, ans=0.2 2023-10-11 14:58:12,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.730e+02 1.950e+02 2.117e+02 2.843e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-11 14:58:16,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743143.3333333334, ans=0.1 2023-10-11 14:58:27,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=743190.0, ans=0.125 2023-10-11 14:58:43,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743283.3333333334, ans=0.1 2023-10-11 14:59:07,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743376.6666666666, ans=0.1 2023-10-11 14:59:08,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=743376.6666666666, ans=0.07 2023-10-11 14:59:20,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=743423.3333333334, ans=10.0 2023-10-11 14:59:20,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.35 vs. limit=22.5 2023-10-11 14:59:26,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=743470.0, ans=0.125 2023-10-11 14:59:28,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743470.0, ans=0.1 2023-10-11 14:59:34,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=743470.0, ans=0.125 2023-10-11 14:59:37,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.96 vs. limit=15.0 2023-10-11 15:00:02,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.279e+02 1.674e+02 1.926e+02 2.088e+02 3.271e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-11 15:00:05,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743610.0, ans=0.1 2023-10-11 15:00:29,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=743750.0, ans=0.125 2023-10-11 15:00:29,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=743750.0, ans=0.0 2023-10-11 15:00:37,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=743796.6666666666, ans=0.125 2023-10-11 15:00:53,227 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:01:08,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=743936.6666666666, ans=0.2 2023-10-11 15:01:10,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-10-11 15:01:10,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=743936.6666666666, ans=0.2 2023-10-11 15:01:13,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=743936.6666666666, ans=0.0 2023-10-11 15:01:13,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=743936.6666666666, ans=0.125 2023-10-11 15:01:44,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=744076.6666666666, ans=0.125 2023-10-11 15:01:46,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.709e+02 1.889e+02 2.081e+02 3.061e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 15:01:48,319 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:01:54,365 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-10-11 15:01:55,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=744123.3333333334, ans=0.125 2023-10-11 15:01:57,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-11 15:02:35,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=744310.0, ans=0.0 2023-10-11 15:02:48,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=744356.6666666666, ans=0.125 2023-10-11 15:03:29,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=744496.6666666666, ans=0.2 2023-10-11 15:03:30,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=744496.6666666666, ans=0.0 2023-10-11 15:03:31,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=744496.6666666666, ans=0.125 2023-10-11 15:03:41,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.816e+02 2.254e+02 2.473e+02 3.370e+02, threshold=4.509e+02, percent-clipped=0.0 2023-10-11 15:04:10,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=744636.6666666666, ans=0.125 2023-10-11 15:04:20,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=744683.3333333334, ans=0.125 2023-10-11 15:04:29,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=744730.0, ans=0.1 2023-10-11 15:04:35,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=744730.0, ans=0.125 2023-10-11 15:04:43,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=744776.6666666666, ans=0.1 2023-10-11 15:04:51,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=744823.3333333334, ans=0.125 2023-10-11 15:05:11,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=15.0 2023-10-11 15:05:13,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=744916.6666666666, ans=0.0 2023-10-11 15:05:27,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=744963.3333333334, ans=0.0 2023-10-11 15:05:39,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-10-11 15:05:40,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.213e+02 1.697e+02 1.864e+02 2.071e+02 3.050e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-11 15:06:06,565 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:06:10,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=745150.0, ans=0.0 2023-10-11 15:06:32,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=745196.6666666666, ans=0.05 2023-10-11 15:06:34,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=745243.3333333334, ans=0.0 2023-10-11 15:06:45,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=745290.0, ans=0.07 2023-10-11 15:06:46,604 INFO [train.py:1031] (3/4) Epoch 12, batch 9500, loss[loss=0.2045, simple_loss=0.3015, pruned_loss=0.05379, over 16835.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2908, pruned_loss=0.05604, over 32539643.04 frames. ], batch size: 155, lr: 2.93e-03, grad_scale: 32.0 2023-10-11 15:06:46,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=745290.0, ans=0.0 2023-10-11 15:06:54,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.62 vs. limit=15.0 2023-10-11 15:07:05,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-10-11 15:07:33,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=745476.6666666666, ans=0.125 2023-10-11 15:07:37,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.83 vs. limit=22.5 2023-10-11 15:07:38,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.667e+02 1.825e+02 2.093e+02 2.768e+02, threshold=3.650e+02, percent-clipped=0.0 2023-10-11 15:07:44,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=745523.3333333334, ans=0.125 2023-10-11 15:08:03,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=745616.6666666666, ans=0.0 2023-10-11 15:08:08,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=745616.6666666666, ans=0.125 2023-10-11 15:08:11,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-10-11 15:08:51,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-10-11 15:08:51,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=745803.3333333334, ans=0.07 2023-10-11 15:09:09,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=745850.0, ans=0.125 2023-10-11 15:09:09,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=745850.0, ans=0.2 2023-10-11 15:09:10,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=745850.0, ans=0.125 2023-10-11 15:09:14,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.99 vs. limit=15.0 2023-10-11 15:09:23,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-11 15:09:28,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.753e+02 1.917e+02 2.218e+02 3.224e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-11 15:09:33,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=745990.0, ans=0.0 2023-10-11 15:09:33,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-11 15:10:00,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=746083.3333333334, ans=0.0 2023-10-11 15:10:08,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=746083.3333333334, ans=0.1 2023-10-11 15:10:10,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-11 15:10:23,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=746176.6666666666, ans=0.2 2023-10-11 15:10:36,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=746223.3333333334, ans=0.125 2023-10-11 15:10:38,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=746223.3333333334, ans=0.0 2023-10-11 15:11:02,454 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.28 vs. limit=15.0 2023-10-11 15:11:13,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=746363.3333333334, ans=0.2 2023-10-11 15:11:21,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-10-11 15:11:23,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.709e+02 1.859e+02 2.130e+02 3.270e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-11 15:11:23,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=746410.0, ans=0.0 2023-10-11 15:11:23,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.87 vs. limit=10.0 2023-10-11 15:11:49,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=746550.0, ans=0.125 2023-10-11 15:12:01,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=746596.6666666666, ans=0.0 2023-10-11 15:12:12,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=746596.6666666666, ans=0.1 2023-10-11 15:12:17,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=746643.3333333334, ans=15.0 2023-10-11 15:12:34,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-10-11 15:12:44,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=746736.6666666666, ans=0.0 2023-10-11 15:13:09,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=746830.0, ans=0.125 2023-10-11 15:13:19,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.282e+02 1.717e+02 1.882e+02 2.097e+02 2.818e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 15:13:21,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=746876.6666666666, ans=0.125 2023-10-11 15:13:23,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-10-11 15:13:53,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=22.5 2023-10-11 15:14:10,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=747110.0, ans=0.125 2023-10-11 15:14:26,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.46 vs. limit=15.0 2023-10-11 15:14:27,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=747156.6666666666, ans=0.0 2023-10-11 15:14:57,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=747296.6666666666, ans=0.0 2023-10-11 15:14:59,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=747296.6666666666, ans=0.125 2023-10-11 15:15:04,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=747343.3333333334, ans=0.125 2023-10-11 15:15:09,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.614e+02 1.799e+02 1.919e+02 2.970e+02, threshold=3.599e+02, percent-clipped=0.0 2023-10-11 15:15:32,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=747436.6666666666, ans=0.0 2023-10-11 15:15:34,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.97 vs. limit=22.5 2023-10-11 15:15:38,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=747483.3333333334, ans=10.0 2023-10-11 15:15:48,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=747530.0, ans=0.1 2023-10-11 15:15:49,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=747530.0, ans=0.2 2023-10-11 15:16:03,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=747576.6666666666, ans=0.04949747468305833 2023-10-11 15:16:09,616 INFO [train.py:1031] (3/4) Epoch 12, batch 10000, loss[loss=0.197, simple_loss=0.2894, pruned_loss=0.05232, over 16948.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2898, pruned_loss=0.05566, over 32586660.24 frames. ], batch size: 138, lr: 2.93e-03, grad_scale: 32.0 2023-10-11 15:16:15,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.03 vs. limit=15.0 2023-10-11 15:16:33,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=747716.6666666666, ans=0.0 2023-10-11 15:16:33,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=747716.6666666666, ans=0.2 2023-10-11 15:16:36,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=747716.6666666666, ans=15.0 2023-10-11 15:16:44,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=747763.3333333334, ans=0.125 2023-10-11 15:16:57,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=747810.0, ans=0.125 2023-10-11 15:16:59,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.720e+02 1.904e+02 2.162e+02 2.851e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-11 15:17:06,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=747856.6666666666, ans=0.0 2023-10-11 15:17:20,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=747903.3333333334, ans=0.0 2023-10-11 15:17:20,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=747903.3333333334, ans=0.0 2023-10-11 15:17:55,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=748043.3333333334, ans=0.125 2023-10-11 15:18:01,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748090.0, ans=0.1 2023-10-11 15:18:14,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=748136.6666666666, ans=0.0 2023-10-11 15:18:43,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=748230.0, ans=0.0 2023-10-11 15:18:48,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=748276.6666666666, ans=0.0 2023-10-11 15:18:50,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748276.6666666666, ans=0.1 2023-10-11 15:18:53,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.741e+02 2.039e+02 2.223e+02 2.984e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-11 15:19:36,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=748463.3333333334, ans=0.125 2023-10-11 15:19:43,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-10-11 15:19:44,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=22.5 2023-10-11 15:19:46,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=748510.0, ans=0.125 2023-10-11 15:20:13,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=748603.3333333334, ans=0.125 2023-10-11 15:20:19,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=748650.0, ans=0.125 2023-10-11 15:20:21,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=748650.0, ans=0.0 2023-10-11 15:20:25,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=748650.0, ans=0.0 2023-10-11 15:20:37,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=748696.6666666666, ans=0.125 2023-10-11 15:20:44,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=748743.3333333334, ans=0.125 2023-10-11 15:20:48,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.704e+02 1.834e+02 2.011e+02 2.765e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 15:21:00,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=748790.0, ans=0.1 2023-10-11 15:21:03,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=748836.6666666666, ans=0.0 2023-10-11 15:21:05,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=748836.6666666666, ans=0.2 2023-10-11 15:21:12,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=748836.6666666666, ans=0.2 2023-10-11 15:21:17,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=748883.3333333334, ans=0.0 2023-10-11 15:21:23,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748883.3333333334, ans=0.1 2023-10-11 15:22:07,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.62 vs. limit=15.0 2023-10-11 15:22:19,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=749116.6666666666, ans=0.125 2023-10-11 15:22:28,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=749163.3333333334, ans=0.0 2023-10-11 15:22:39,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=749210.0, ans=0.125 2023-10-11 15:22:41,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=749210.0, ans=0.0 2023-10-11 15:22:45,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.656e+02 1.817e+02 2.081e+02 2.546e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-11 15:22:48,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=749210.0, ans=0.04949747468305833 2023-10-11 15:22:54,157 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=12.0 2023-10-11 15:22:58,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=749256.6666666666, ans=0.025 2023-10-11 15:22:59,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=749256.6666666666, ans=0.0 2023-10-11 15:23:02,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=749303.3333333334, ans=0.0 2023-10-11 15:23:03,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=749303.3333333334, ans=0.2 2023-10-11 15:23:04,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749303.3333333334, ans=0.1 2023-10-11 15:23:05,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=22.5 2023-10-11 15:23:21,313 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:23:25,029 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:23:25,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=749396.6666666666, ans=0.04949747468305833 2023-10-11 15:23:25,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=749396.6666666666, ans=0.025 2023-10-11 15:23:28,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=749396.6666666666, ans=0.1 2023-10-11 15:23:56,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=749490.0, ans=0.125 2023-10-11 15:24:00,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=749536.6666666666, ans=0.125 2023-10-11 15:24:17,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=749583.3333333334, ans=0.125 2023-10-11 15:24:19,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-10-11 15:24:37,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=749676.6666666666, ans=0.0 2023-10-11 15:24:45,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.648e+02 1.834e+02 2.097e+02 3.063e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 15:25:06,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-11 15:25:14,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.66 vs. limit=22.5 2023-10-11 15:25:19,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-10-11 15:25:29,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749863.3333333334, ans=0.1 2023-10-11 15:25:34,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=749910.0, ans=0.125 2023-10-11 15:25:44,989 INFO [train.py:1031] (3/4) Epoch 12, batch 10500, loss[loss=0.194, simple_loss=0.2838, pruned_loss=0.05214, over 16307.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2902, pruned_loss=0.05578, over 32620786.15 frames. ], batch size: 50, lr: 2.92e-03, grad_scale: 32.0 2023-10-11 15:25:45,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-10-11 15:25:47,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=749956.6666666666, ans=0.1 2023-10-11 15:26:13,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=750050.0, ans=0.0 2023-10-11 15:26:17,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.41 vs. limit=15.0 2023-10-11 15:26:17,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-10-11 15:26:20,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=750096.6666666666, ans=0.2 2023-10-11 15:26:30,429 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:26:33,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.672e+02 1.918e+02 2.122e+02 3.044e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 15:26:33,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=750143.3333333334, ans=0.2 2023-10-11 15:27:38,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=750376.6666666666, ans=0.1 2023-10-11 15:27:52,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=750470.0, ans=0.0 2023-10-11 15:27:54,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=750470.0, ans=0.125 2023-10-11 15:27:56,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=750470.0, ans=0.0 2023-10-11 15:28:15,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=750563.3333333334, ans=0.0 2023-10-11 15:28:20,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=750563.3333333334, ans=0.0 2023-10-11 15:28:35,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.725e+02 1.833e+02 2.012e+02 2.637e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-11 15:28:35,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=750610.0, ans=0.0 2023-10-11 15:28:48,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=750703.3333333334, ans=0.125 2023-10-11 15:28:55,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=750703.3333333334, ans=0.125 2023-10-11 15:29:22,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750843.3333333334, ans=0.1 2023-10-11 15:29:30,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=750843.3333333334, ans=0.125 2023-10-11 15:29:36,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=750890.0, ans=0.0 2023-10-11 15:29:36,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-10-11 15:30:06,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=750983.3333333334, ans=0.125 2023-10-11 15:30:10,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=751030.0, ans=0.0 2023-10-11 15:30:18,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=751030.0, ans=0.125 2023-10-11 15:30:19,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=751076.6666666666, ans=0.025 2023-10-11 15:30:27,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.664e+02 1.787e+02 1.988e+02 2.570e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 15:30:37,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=751123.3333333334, ans=0.125 2023-10-11 15:30:40,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2023-10-11 15:30:58,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=751216.6666666666, ans=0.2 2023-10-11 15:31:22,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=751310.0, ans=0.2 2023-10-11 15:31:32,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=751356.6666666666, ans=0.125 2023-10-11 15:31:50,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=751450.0, ans=0.0 2023-10-11 15:31:56,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=751450.0, ans=0.125 2023-10-11 15:32:03,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=751496.6666666666, ans=0.0 2023-10-11 15:32:17,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.748e+02 1.895e+02 2.139e+02 3.390e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-11 15:32:19,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-11 15:32:36,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=751636.6666666666, ans=10.0 2023-10-11 15:32:51,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=751683.3333333334, ans=0.125 2023-10-11 15:32:59,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.89 vs. limit=15.0 2023-10-11 15:32:59,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-10-11 15:33:04,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-10-11 15:33:10,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=751776.6666666666, ans=0.125 2023-10-11 15:33:23,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=751823.3333333334, ans=0.1 2023-10-11 15:33:27,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-10-11 15:33:37,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=751870.0, ans=0.125 2023-10-11 15:33:40,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=751916.6666666666, ans=0.125 2023-10-11 15:33:40,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=751916.6666666666, ans=0.125 2023-10-11 15:34:10,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.580e+02 1.725e+02 1.875e+02 2.764e+02, threshold=3.450e+02, percent-clipped=0.0 2023-10-11 15:34:19,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-10-11 15:34:35,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2023-10-11 15:34:55,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=752196.6666666666, ans=0.125 2023-10-11 15:35:03,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=752243.3333333334, ans=0.125 2023-10-11 15:35:06,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=752290.0, ans=0.0 2023-10-11 15:35:06,784 INFO [train.py:1031] (3/4) Epoch 12, batch 11000, loss[loss=0.2132, simple_loss=0.3069, pruned_loss=0.05981, over 16819.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2903, pruned_loss=0.05589, over 32657606.50 frames. ], batch size: 188, lr: 2.92e-03, grad_scale: 32.0 2023-10-11 15:35:11,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=752290.0, ans=0.2 2023-10-11 15:35:16,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=752290.0, ans=0.2 2023-10-11 15:35:16,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=752290.0, ans=0.0 2023-10-11 15:35:16,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=752290.0, ans=0.0 2023-10-11 15:35:50,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=752476.6666666666, ans=0.0 2023-10-11 15:35:50,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-10-11 15:35:53,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=752476.6666666666, ans=0.1 2023-10-11 15:35:57,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.799e+02 2.068e+02 2.258e+02 3.113e+02, threshold=4.137e+02, percent-clipped=0.0 2023-10-11 15:35:58,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=752476.6666666666, ans=0.0 2023-10-11 15:36:16,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=752570.0, ans=10.0 2023-10-11 15:36:20,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=752570.0, ans=0.0 2023-10-11 15:36:40,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=752663.3333333334, ans=0.125 2023-10-11 15:36:58,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=752710.0, ans=0.0 2023-10-11 15:37:10,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=752756.6666666666, ans=10.0 2023-10-11 15:37:22,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=752803.3333333334, ans=10.0 2023-10-11 15:37:31,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=752850.0, ans=0.125 2023-10-11 15:37:44,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=752896.6666666666, ans=0.0 2023-10-11 15:37:52,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=752943.3333333334, ans=0.125 2023-10-11 15:37:55,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=752943.3333333334, ans=0.0 2023-10-11 15:38:01,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.667e+02 1.871e+02 2.163e+02 3.458e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 15:38:05,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-10-11 15:38:13,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=752990.0, ans=0.125 2023-10-11 15:38:16,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=753036.6666666666, ans=0.125 2023-10-11 15:38:21,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=753036.6666666666, ans=0.0 2023-10-11 15:38:25,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=753083.3333333334, ans=0.125 2023-10-11 15:38:35,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=753083.3333333334, ans=0.1 2023-10-11 15:38:41,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=753130.0, ans=0.125 2023-10-11 15:38:42,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=753130.0, ans=0.125 2023-10-11 15:39:03,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=753223.3333333334, ans=0.125 2023-10-11 15:39:07,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.13 vs. limit=15.0 2023-10-11 15:39:26,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=753316.6666666666, ans=0.0 2023-10-11 15:39:50,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.640e+02 1.780e+02 1.959e+02 3.212e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-11 15:40:04,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=753503.3333333334, ans=0.0 2023-10-11 15:40:07,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=753503.3333333334, ans=0.0 2023-10-11 15:40:08,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=753503.3333333334, ans=0.0 2023-10-11 15:40:16,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=753503.3333333334, ans=0.0 2023-10-11 15:40:17,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=753503.3333333334, ans=0.2 2023-10-11 15:40:29,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=753550.0, ans=0.2 2023-10-11 15:40:34,942 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:40:37,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=753596.6666666666, ans=0.125 2023-10-11 15:40:52,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.68 vs. limit=15.0 2023-10-11 15:41:01,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=753690.0, ans=0.125 2023-10-11 15:41:07,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.32 vs. limit=15.0 2023-10-11 15:41:14,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=753736.6666666666, ans=0.95 2023-10-11 15:41:36,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=12.0 2023-10-11 15:41:40,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2023-10-11 15:41:45,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=753876.6666666666, ans=0.0 2023-10-11 15:41:50,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.705e+02 1.886e+02 2.119e+02 3.337e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-11 15:42:06,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=753970.0, ans=0.5 2023-10-11 15:42:09,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=753970.0, ans=0.2 2023-10-11 15:42:16,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=753970.0, ans=0.125 2023-10-11 15:42:44,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=754110.0, ans=0.125 2023-10-11 15:42:58,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=754156.6666666666, ans=0.125 2023-10-11 15:43:06,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=754203.3333333334, ans=0.0 2023-10-11 15:43:20,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=754250.0, ans=0.1 2023-10-11 15:43:43,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=754343.3333333334, ans=0.0 2023-10-11 15:43:47,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.808e+02 2.019e+02 2.429e+02 3.277e+02, threshold=4.039e+02, percent-clipped=0.0 2023-10-11 15:43:52,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=754390.0, ans=0.0 2023-10-11 15:43:53,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-10-11 15:43:58,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-10-11 15:43:58,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=754390.0, ans=0.0 2023-10-11 15:44:00,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=754436.6666666666, ans=0.0 2023-10-11 15:44:09,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=754436.6666666666, ans=0.2 2023-10-11 15:44:28,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=754530.0, ans=0.125 2023-10-11 15:44:29,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=754530.0, ans=0.125 2023-10-11 15:44:44,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.12 vs. limit=15.0 2023-10-11 15:44:46,864 INFO [train.py:1031] (3/4) Epoch 12, batch 11500, loss[loss=0.212, simple_loss=0.3004, pruned_loss=0.06183, over 16102.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.29, pruned_loss=0.0558, over 32689517.18 frames. ], batch size: 43, lr: 2.91e-03, grad_scale: 32.0 2023-10-11 15:45:12,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.09 vs. limit=10.0 2023-10-11 15:45:16,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=15.0 2023-10-11 15:45:32,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=754810.0, ans=0.0 2023-10-11 15:45:40,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.738e+02 1.978e+02 2.245e+02 3.164e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-11 15:45:40,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=754810.0, ans=0.125 2023-10-11 15:45:44,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=754856.6666666666, ans=0.1 2023-10-11 15:45:53,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=754856.6666666666, ans=0.2 2023-10-11 15:45:55,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=754856.6666666666, ans=0.125 2023-10-11 15:46:12,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=754950.0, ans=0.125 2023-10-11 15:46:42,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.65 vs. limit=22.5 2023-10-11 15:46:45,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=755090.0, ans=0.09899494936611666 2023-10-11 15:46:49,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=755090.0, ans=0.2 2023-10-11 15:46:49,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=755090.0, ans=0.2 2023-10-11 15:46:57,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.38 vs. limit=15.0 2023-10-11 15:47:18,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=755183.3333333334, ans=0.5 2023-10-11 15:47:24,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=755230.0, ans=0.0 2023-10-11 15:47:40,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.614e+02 1.757e+02 2.008e+02 2.651e+02, threshold=3.514e+02, percent-clipped=0.0 2023-10-11 15:47:44,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.99 vs. limit=15.0 2023-10-11 15:47:44,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=755323.3333333334, ans=0.125 2023-10-11 15:47:54,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.31 vs. limit=6.0 2023-10-11 15:48:32,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=755510.0, ans=0.125 2023-10-11 15:48:33,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=755510.0, ans=0.125 2023-10-11 15:48:49,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=755603.3333333334, ans=0.125 2023-10-11 15:48:55,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=755603.3333333334, ans=0.125 2023-10-11 15:49:13,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-10-11 15:49:16,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=755696.6666666666, ans=0.0 2023-10-11 15:49:16,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-11 15:49:33,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.686e+02 1.852e+02 2.113e+02 2.806e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-11 15:49:35,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=755790.0, ans=0.1 2023-10-11 15:49:36,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=755790.0, ans=0.2 2023-10-11 15:49:44,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=15.0 2023-10-11 15:49:58,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-10-11 15:50:22,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=755930.0, ans=0.125 2023-10-11 15:50:45,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=756023.3333333334, ans=0.125 2023-10-11 15:51:08,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=756116.6666666666, ans=0.125 2023-10-11 15:51:40,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.661e+02 1.854e+02 2.117e+02 3.376e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 15:51:51,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=756256.6666666666, ans=10.0 2023-10-11 15:52:01,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=756303.3333333334, ans=0.125 2023-10-11 15:52:49,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-10-11 15:53:06,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=756583.3333333334, ans=0.0 2023-10-11 15:53:18,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=756630.0, ans=0.1 2023-10-11 15:53:23,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=756630.0, ans=0.125 2023-10-11 15:53:38,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.714e+02 1.865e+02 2.159e+02 3.232e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-11 15:53:48,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=756723.3333333334, ans=0.125 2023-10-11 15:54:03,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=756816.6666666666, ans=0.125 2023-10-11 15:54:36,250 INFO [train.py:1031] (3/4) Epoch 12, batch 12000, loss[loss=0.1927, simple_loss=0.2853, pruned_loss=0.0501, over 16532.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2901, pruned_loss=0.05553, over 32724661.30 frames. ], batch size: 241, lr: 2.91e-03, grad_scale: 16.0 2023-10-11 15:54:53,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=757003.3333333334, ans=0.125 2023-10-11 15:54:57,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=757003.3333333334, ans=0.2 2023-10-11 15:55:08,915 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:55:29,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=757143.3333333334, ans=0.125 2023-10-11 15:55:31,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=757143.3333333334, ans=0.125 2023-10-11 15:55:32,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.656e+02 1.798e+02 2.124e+02 3.725e+02, threshold=3.595e+02, percent-clipped=0.0 2023-10-11 15:55:35,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=757190.0, ans=0.125 2023-10-11 15:55:54,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=757236.6666666666, ans=0.125 2023-10-11 15:56:12,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=757330.0, ans=0.125 2023-10-11 15:56:28,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=757376.6666666666, ans=0.125 2023-10-11 15:56:46,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2023-10-11 15:57:03,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=757563.3333333334, ans=0.0 2023-10-11 15:57:03,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=757563.3333333334, ans=0.125 2023-10-11 15:57:04,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.20 vs. limit=6.0 2023-10-11 15:57:09,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=757563.3333333334, ans=0.0 2023-10-11 15:57:11,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=757563.3333333334, ans=0.1 2023-10-11 15:57:13,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=757563.3333333334, ans=0.125 2023-10-11 15:57:15,519 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:57:15,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=22.5 2023-10-11 15:57:21,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-11 15:57:23,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=757610.0, ans=0.0 2023-10-11 15:57:24,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.637e+02 1.855e+02 2.003e+02 3.373e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-11 15:57:37,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=757703.3333333334, ans=0.1 2023-10-11 15:57:41,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2023-10-11 15:57:46,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=757750.0, ans=0.125 2023-10-11 15:58:47,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=757983.3333333334, ans=0.125 2023-10-11 15:59:01,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.66 vs. limit=15.0 2023-10-11 15:59:07,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.06 vs. limit=15.0 2023-10-11 15:59:14,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.707e+02 1.867e+02 2.036e+02 2.749e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 15:59:17,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=758123.3333333334, ans=0.125 2023-10-11 15:59:36,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=758216.6666666666, ans=0.125 2023-10-11 15:59:38,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=758216.6666666666, ans=0.05 2023-10-11 15:59:47,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=758263.3333333334, ans=0.0 2023-10-11 16:00:04,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=758310.0, ans=0.125 2023-10-11 16:00:13,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.65 vs. limit=15.0 2023-10-11 16:00:14,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.23 vs. limit=6.0 2023-10-11 16:00:35,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=758450.0, ans=0.125 2023-10-11 16:00:37,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=758450.0, ans=0.07 2023-10-11 16:00:55,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=758543.3333333334, ans=0.125 2023-10-11 16:01:02,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=758543.3333333334, ans=0.0 2023-10-11 16:01:06,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.696e+02 1.878e+02 2.108e+02 3.119e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 16:01:07,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-10-11 16:01:25,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=758636.6666666666, ans=0.125 2023-10-11 16:01:40,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=758683.3333333334, ans=0.5 2023-10-11 16:02:13,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=758823.3333333334, ans=0.0 2023-10-11 16:02:17,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=758870.0, ans=0.035 2023-10-11 16:02:46,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=758963.3333333334, ans=0.125 2023-10-11 16:03:04,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.735e+02 1.877e+02 2.156e+02 3.156e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-11 16:03:10,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=759056.6666666666, ans=0.09899494936611666 2023-10-11 16:03:15,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2023-10-11 16:03:17,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=759103.3333333334, ans=0.0 2023-10-11 16:03:18,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=759103.3333333334, ans=0.125 2023-10-11 16:03:35,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-10-11 16:03:55,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=759243.3333333334, ans=0.07 2023-10-11 16:03:55,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=759243.3333333334, ans=0.125 2023-10-11 16:04:00,917 INFO [train.py:1031] (3/4) Epoch 12, batch 12500, loss[loss=0.2018, simple_loss=0.2933, pruned_loss=0.05513, over 16946.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.29, pruned_loss=0.05562, over 32774201.73 frames. ], batch size: 72, lr: 2.91e-03, grad_scale: 16.0 2023-10-11 16:04:11,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-10-11 16:04:18,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=759336.6666666666, ans=0.125 2023-10-11 16:04:37,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=759430.0, ans=0.0 2023-10-11 16:04:38,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=759430.0, ans=0.1 2023-10-11 16:04:48,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=759476.6666666666, ans=0.5 2023-10-11 16:04:55,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.670e+02 1.844e+02 2.059e+02 2.894e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-11 16:04:57,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=759523.3333333334, ans=0.125 2023-10-11 16:04:59,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=759523.3333333334, ans=0.0 2023-10-11 16:05:08,053 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-10-11 16:05:28,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.54 vs. limit=10.0 2023-10-11 16:05:49,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=759756.6666666666, ans=0.0 2023-10-11 16:06:13,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=759850.0, ans=0.0 2023-10-11 16:06:23,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=22.5 2023-10-11 16:06:36,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=759943.3333333334, ans=0.5 2023-10-11 16:06:40,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=759943.3333333334, ans=0.035 2023-10-11 16:06:43,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.18 vs. limit=6.0 2023-10-11 16:06:46,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.668e+02 1.891e+02 2.120e+02 2.902e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-11 16:07:05,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=760036.6666666666, ans=0.125 2023-10-11 16:07:16,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.90 vs. limit=22.5 2023-10-11 16:07:20,240 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:07:30,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=760130.0, ans=0.2 2023-10-11 16:07:37,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=760176.6666666666, ans=0.09899494936611666 2023-10-11 16:07:38,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=760176.6666666666, ans=0.0 2023-10-11 16:07:56,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=760223.3333333334, ans=0.0 2023-10-11 16:08:10,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.79 vs. limit=15.0 2023-10-11 16:08:16,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=760316.6666666666, ans=0.0 2023-10-11 16:08:26,218 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.65 vs. limit=10.0 2023-10-11 16:08:40,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.712e+02 1.879e+02 2.129e+02 3.461e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-11 16:08:59,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=760503.3333333334, ans=22.5 2023-10-11 16:09:08,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=760550.0, ans=0.125 2023-10-11 16:09:08,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=760550.0, ans=0.125 2023-10-11 16:09:09,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=760550.0, ans=0.1 2023-10-11 16:09:38,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=760690.0, ans=0.125 2023-10-11 16:09:40,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=22.5 2023-10-11 16:09:40,461 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:09:49,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=760736.6666666666, ans=0.125 2023-10-11 16:10:01,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=760783.3333333334, ans=0.1 2023-10-11 16:10:07,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.31 vs. limit=10.0 2023-10-11 16:10:09,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=760830.0, ans=0.1 2023-10-11 16:10:15,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=760830.0, ans=0.0 2023-10-11 16:10:21,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=760876.6666666666, ans=0.2 2023-10-11 16:10:32,365 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.727e+02 1.956e+02 2.282e+02 3.159e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-11 16:10:37,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=760923.3333333334, ans=0.125 2023-10-11 16:10:40,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=760923.3333333334, ans=0.125 2023-10-11 16:10:42,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=760923.3333333334, ans=0.125 2023-10-11 16:10:43,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.14 vs. limit=15.0 2023-10-11 16:10:47,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=760970.0, ans=0.0 2023-10-11 16:11:08,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=761063.3333333334, ans=0.0 2023-10-11 16:11:21,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=761110.0, ans=0.2 2023-10-11 16:12:27,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.632e+02 1.753e+02 1.957e+02 2.587e+02, threshold=3.505e+02, percent-clipped=0.0 2023-10-11 16:12:56,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.19 vs. limit=15.0 2023-10-11 16:13:08,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-10-11 16:13:10,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=761576.6666666666, ans=0.125 2023-10-11 16:13:19,687 INFO [train.py:1031] (3/4) Epoch 12, batch 13000, loss[loss=0.2054, simple_loss=0.2954, pruned_loss=0.0577, over 16917.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2906, pruned_loss=0.0557, over 32781000.15 frames. ], batch size: 156, lr: 2.90e-03, grad_scale: 16.0 2023-10-11 16:13:21,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.02 vs. limit=22.5 2023-10-11 16:13:31,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-10-11 16:13:33,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=761670.0, ans=0.125 2023-10-11 16:13:39,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.30 vs. limit=15.0 2023-10-11 16:13:48,781 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:14:24,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2023-10-11 16:14:27,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.731e+02 1.935e+02 2.290e+02 3.397e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-11 16:15:02,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=761996.6666666666, ans=0.1 2023-10-11 16:15:09,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=761996.6666666666, ans=0.125 2023-10-11 16:15:27,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-10-11 16:15:34,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=762136.6666666666, ans=0.2 2023-10-11 16:15:40,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=762136.6666666666, ans=0.125 2023-10-11 16:15:41,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=762136.6666666666, ans=0.125 2023-10-11 16:16:20,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.652e+02 1.832e+02 2.054e+02 3.339e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-11 16:16:24,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=22.5 2023-10-11 16:16:32,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=762370.0, ans=0.0 2023-10-11 16:16:43,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=762416.6666666666, ans=0.125 2023-10-11 16:16:47,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=762416.6666666666, ans=0.05 2023-10-11 16:16:47,490 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:17:35,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=762603.3333333334, ans=0.0 2023-10-11 16:17:46,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=762650.0, ans=0.1 2023-10-11 16:18:06,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=762743.3333333334, ans=0.2 2023-10-11 16:18:06,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=762743.3333333334, ans=0.0 2023-10-11 16:18:17,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.696e+02 1.832e+02 2.062e+02 3.018e+02, threshold=3.665e+02, percent-clipped=0.0 2023-10-11 16:18:31,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=762836.6666666666, ans=0.125 2023-10-11 16:18:39,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=762883.3333333334, ans=0.125 2023-10-11 16:18:51,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=762930.0, ans=0.125 2023-10-11 16:18:52,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.14 vs. limit=10.0 2023-10-11 16:18:53,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=762930.0, ans=0.1 2023-10-11 16:18:57,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=762930.0, ans=0.0 2023-10-11 16:19:09,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=762976.6666666666, ans=0.125 2023-10-11 16:19:16,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=763023.3333333334, ans=0.2 2023-10-11 16:19:18,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=763023.3333333334, ans=0.125 2023-10-11 16:19:32,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=763070.0, ans=0.125 2023-10-11 16:19:47,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=763163.3333333334, ans=0.2 2023-10-11 16:20:03,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=763210.0, ans=0.05 2023-10-11 16:20:07,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.771e+02 1.947e+02 2.146e+02 2.873e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-11 16:20:15,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=763256.6666666666, ans=0.0 2023-10-11 16:20:32,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2023-10-11 16:20:33,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=763350.0, ans=0.125 2023-10-11 16:20:41,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=763396.6666666666, ans=0.0 2023-10-11 16:20:46,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=763396.6666666666, ans=0.05 2023-10-11 16:21:13,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=763490.0, ans=0.125 2023-10-11 16:21:18,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=763536.6666666666, ans=0.125 2023-10-11 16:21:25,935 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.40 vs. limit=15.0 2023-10-11 16:21:26,775 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:21:31,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=763583.3333333334, ans=0.125 2023-10-11 16:21:37,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=763583.3333333334, ans=0.125 2023-10-11 16:21:55,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=763676.6666666666, ans=0.1 2023-10-11 16:22:02,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.708e+02 1.909e+02 2.135e+02 2.881e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-11 16:22:04,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=763723.3333333334, ans=0.0 2023-10-11 16:22:07,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=763723.3333333334, ans=0.1 2023-10-11 16:22:10,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-10-11 16:22:12,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=763770.0, ans=0.125 2023-10-11 16:22:22,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763816.6666666666, ans=0.1 2023-10-11 16:22:34,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=763863.3333333334, ans=0.125 2023-10-11 16:22:56,417 INFO [train.py:1031] (3/4) Epoch 12, batch 13500, loss[loss=0.2114, simple_loss=0.2891, pruned_loss=0.06684, over 15494.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2898, pruned_loss=0.05552, over 32802902.62 frames. ], batch size: 35, lr: 2.90e-03, grad_scale: 32.0 2023-10-11 16:23:04,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-10-11 16:23:11,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=764003.3333333334, ans=0.0 2023-10-11 16:23:27,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764050.0, ans=0.1 2023-10-11 16:23:39,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=764096.6666666666, ans=0.125 2023-10-11 16:23:42,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=764143.3333333334, ans=0.125 2023-10-11 16:23:50,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=764190.0, ans=0.125 2023-10-11 16:23:52,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.734e+02 1.927e+02 2.347e+02 3.260e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-11 16:24:07,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=764236.6666666666, ans=0.09899494936611666 2023-10-11 16:24:16,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=764283.3333333334, ans=0.2 2023-10-11 16:24:24,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-10-11 16:24:27,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=764330.0, ans=0.0 2023-10-11 16:24:33,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=764330.0, ans=0.125 2023-10-11 16:25:00,355 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:25:02,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=764470.0, ans=0.1 2023-10-11 16:25:20,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=764563.3333333334, ans=0.125 2023-10-11 16:25:37,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.732e+02 1.901e+02 2.142e+02 3.181e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 16:26:16,797 INFO [train.py:1031] (3/4) Epoch 13, batch 0, loss[loss=0.1713, simple_loss=0.2692, pruned_loss=0.03671, over 16778.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2692, pruned_loss=0.03671, over 16778.00 frames. ], batch size: 98, lr: 2.77e-03, grad_scale: 32.0 2023-10-11 16:26:16,799 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-11 16:26:24,455 INFO [train.py:1063] (3/4) Epoch 13, validation: loss=0.2183, simple_loss=0.306, pruned_loss=0.06527, over 1020973.00 frames. 2023-10-11 16:26:24,457 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-11 16:26:26,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=764684.6666666666, ans=0.125 2023-10-11 16:26:26,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=764684.6666666666, ans=0.125 2023-10-11 16:26:30,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=764684.6666666666, ans=0.125 2023-10-11 16:26:38,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=764731.3333333334, ans=0.0 2023-10-11 16:26:45,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=764731.3333333334, ans=0.1 2023-10-11 16:26:48,495 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:26:54,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=764778.0, ans=0.1 2023-10-11 16:27:07,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=764824.6666666666, ans=0.0 2023-10-11 16:27:14,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=764871.3333333334, ans=0.0 2023-10-11 16:27:53,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=765011.3333333334, ans=0.125 2023-10-11 16:28:12,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.708e+02 1.952e+02 2.399e+02 5.505e+02, threshold=3.905e+02, percent-clipped=4.0 2023-10-11 16:28:16,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765104.6666666666, ans=0.1 2023-10-11 16:28:24,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=765151.3333333334, ans=0.125 2023-10-11 16:28:26,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=765151.3333333334, ans=0.125 2023-10-11 16:28:34,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=765198.0, ans=0.0 2023-10-11 16:28:37,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.45 vs. limit=15.0 2023-10-11 16:28:38,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=765198.0, ans=0.125 2023-10-11 16:28:43,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=765244.6666666666, ans=0.2 2023-10-11 16:29:03,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=765338.0, ans=0.125 2023-10-11 16:29:07,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=765338.0, ans=0.0 2023-10-11 16:29:10,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765338.0, ans=0.1 2023-10-11 16:29:12,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=765338.0, ans=0.125 2023-10-11 16:29:14,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=15.0 2023-10-11 16:29:16,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=765384.6666666666, ans=0.1 2023-10-11 16:29:16,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=765384.6666666666, ans=0.125 2023-10-11 16:29:18,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=765384.6666666666, ans=0.5 2023-10-11 16:29:21,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=765384.6666666666, ans=0.2 2023-10-11 16:29:25,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=765431.3333333334, ans=0.125 2023-10-11 16:30:02,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.643e+02 1.767e+02 1.891e+02 2.393e+02, threshold=3.534e+02, percent-clipped=0.0 2023-10-11 16:30:10,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-10-11 16:30:14,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=765618.0, ans=0.125 2023-10-11 16:30:22,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=765664.6666666666, ans=0.125 2023-10-11 16:30:28,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=765711.3333333334, ans=0.2 2023-10-11 16:30:33,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=765711.3333333334, ans=0.0 2023-10-11 16:30:36,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=765758.0, ans=0.125 2023-10-11 16:31:09,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.37 vs. limit=22.5 2023-10-11 16:31:20,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=765898.0, ans=0.2 2023-10-11 16:31:25,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=765898.0, ans=0.2 2023-10-11 16:31:27,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=765898.0, ans=10.0 2023-10-11 16:31:33,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=765944.6666666666, ans=0.0 2023-10-11 16:31:41,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=765991.3333333334, ans=0.125 2023-10-11 16:31:49,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=766038.0, ans=0.0 2023-10-11 16:31:53,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=766038.0, ans=0.125 2023-10-11 16:31:55,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.676e+02 1.842e+02 2.223e+02 3.061e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-11 16:32:00,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=766084.6666666666, ans=0.125 2023-10-11 16:32:10,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=766084.6666666666, ans=0.1 2023-10-11 16:33:19,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=766411.3333333334, ans=15.0 2023-10-11 16:33:29,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.30 vs. limit=22.5 2023-10-11 16:33:42,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=766504.6666666666, ans=0.2 2023-10-11 16:33:43,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.313e+02 1.732e+02 1.917e+02 2.155e+02 3.157e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-11 16:33:47,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.23 vs. limit=22.5 2023-10-11 16:33:49,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=766551.3333333334, ans=0.0 2023-10-11 16:33:56,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=766551.3333333334, ans=0.125 2023-10-11 16:34:09,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=766598.0, ans=0.125 2023-10-11 16:34:14,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=766644.6666666666, ans=0.0 2023-10-11 16:34:15,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=766644.6666666666, ans=0.125 2023-10-11 16:34:21,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=766691.3333333334, ans=0.1 2023-10-11 16:34:48,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=766784.6666666666, ans=0.125 2023-10-11 16:34:50,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-10-11 16:35:00,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=766831.3333333334, ans=0.125 2023-10-11 16:35:24,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=766924.6666666666, ans=0.125 2023-10-11 16:35:25,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=766924.6666666666, ans=0.1 2023-10-11 16:35:36,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=766971.3333333334, ans=0.125 2023-10-11 16:35:39,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.777e+02 1.942e+02 2.286e+02 3.304e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-11 16:35:43,040 INFO [train.py:1031] (3/4) Epoch 13, batch 500, loss[loss=0.1802, simple_loss=0.273, pruned_loss=0.04369, over 16553.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2907, pruned_loss=0.05599, over 7282550.75 frames. ], batch size: 219, lr: 2.77e-03, grad_scale: 32.0 2023-10-11 16:35:58,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.86 vs. limit=10.0 2023-10-11 16:36:05,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-11 16:36:11,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=767111.3333333334, ans=0.0 2023-10-11 16:36:31,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=767204.6666666666, ans=0.1 2023-10-11 16:36:31,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=767204.6666666666, ans=0.125 2023-10-11 16:36:31,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=767204.6666666666, ans=0.0 2023-10-11 16:36:33,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-10-11 16:36:37,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=767251.3333333334, ans=0.035 2023-10-11 16:36:41,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-10-11 16:36:50,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=767298.0, ans=0.0 2023-10-11 16:37:05,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=767344.6666666666, ans=0.125 2023-10-11 16:37:23,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=767438.0, ans=0.125 2023-10-11 16:37:31,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.779e+02 2.034e+02 2.282e+02 3.435e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-11 16:37:42,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=767484.6666666666, ans=0.1 2023-10-11 16:37:58,752 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:38:11,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=767624.6666666666, ans=0.2 2023-10-11 16:38:20,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2023-10-11 16:38:30,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=767671.3333333334, ans=0.2 2023-10-11 16:38:38,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=767718.0, ans=0.125 2023-10-11 16:38:43,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=767764.6666666666, ans=0.2 2023-10-11 16:38:55,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=767811.3333333334, ans=0.02 2023-10-11 16:38:58,573 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:39:16,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.21 vs. limit=10.0 2023-10-11 16:39:21,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.784e+02 1.927e+02 2.115e+02 2.737e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-11 16:39:26,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=767951.3333333334, ans=0.5 2023-10-11 16:39:47,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=768044.6666666666, ans=0.0 2023-10-11 16:39:55,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=768044.6666666666, ans=0.125 2023-10-11 16:40:12,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=768138.0, ans=0.0 2023-10-11 16:40:16,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=768138.0, ans=0.0 2023-10-11 16:40:23,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=768138.0, ans=0.125 2023-10-11 16:40:26,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=768184.6666666666, ans=0.0 2023-10-11 16:40:29,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=768184.6666666666, ans=0.0 2023-10-11 16:40:30,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-10-11 16:40:35,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-11 16:40:42,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=768231.3333333334, ans=0.125 2023-10-11 16:40:43,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=768231.3333333334, ans=0.125 2023-10-11 16:40:57,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=768278.0, ans=0.0 2023-10-11 16:40:57,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-10-11 16:41:17,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.733e+02 1.811e+02 2.071e+02 2.964e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-11 16:41:26,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=768418.0, ans=0.2 2023-10-11 16:41:28,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.73 vs. limit=15.0 2023-10-11 16:41:35,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.74 vs. limit=15.0 2023-10-11 16:41:40,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=768464.6666666666, ans=0.0 2023-10-11 16:41:42,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-10-11 16:41:57,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=768511.3333333334, ans=0.0 2023-10-11 16:42:05,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=768558.0, ans=0.0 2023-10-11 16:42:08,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=768558.0, ans=0.0 2023-10-11 16:42:12,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.48 vs. limit=15.0 2023-10-11 16:42:14,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.35 vs. limit=15.0 2023-10-11 16:42:46,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-11 16:42:48,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=768698.0, ans=0.1 2023-10-11 16:42:54,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=768744.6666666666, ans=0.0 2023-10-11 16:43:01,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2023-10-11 16:43:02,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-10-11 16:43:11,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=768791.3333333334, ans=0.125 2023-10-11 16:43:23,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.719e+02 1.883e+02 2.125e+02 2.964e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 16:43:24,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768838.0, ans=0.1 2023-10-11 16:43:33,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=768884.6666666666, ans=0.125 2023-10-11 16:43:51,248 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.20 vs. limit=10.0 2023-10-11 16:43:57,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=768978.0, ans=0.2 2023-10-11 16:44:03,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-10-11 16:44:12,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=769071.3333333334, ans=0.1 2023-10-11 16:44:13,000 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-10-11 16:44:21,687 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:44:23,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.04 vs. limit=15.0 2023-10-11 16:44:32,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-10-11 16:44:37,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=769164.6666666666, ans=0.125 2023-10-11 16:44:40,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.31 vs. limit=10.0 2023-10-11 16:45:07,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=769258.0, ans=0.125 2023-10-11 16:45:10,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=769304.6666666666, ans=0.125 2023-10-11 16:45:18,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.666e+02 1.805e+02 1.947e+02 2.979e+02, threshold=3.611e+02, percent-clipped=0.0 2023-10-11 16:45:21,414 INFO [train.py:1031] (3/4) Epoch 13, batch 1000, loss[loss=0.1993, simple_loss=0.292, pruned_loss=0.05334, over 16872.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2908, pruned_loss=0.05614, over 12944804.02 frames. ], batch size: 110, lr: 2.76e-03, grad_scale: 32.0 2023-10-11 16:45:24,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=769351.3333333334, ans=0.125 2023-10-11 16:45:36,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-11 16:45:58,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769491.3333333334, ans=0.1 2023-10-11 16:46:12,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=769538.0, ans=0.125 2023-10-11 16:46:13,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=769584.6666666666, ans=0.125 2023-10-11 16:46:24,470 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:46:27,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.69 vs. limit=15.0 2023-10-11 16:46:28,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=769631.3333333334, ans=0.04949747468305833 2023-10-11 16:46:51,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=769724.6666666666, ans=0.0 2023-10-11 16:46:54,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-10-11 16:47:06,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.690e+02 1.909e+02 2.091e+02 2.792e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-11 16:47:15,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769818.0, ans=0.1 2023-10-11 16:47:30,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769864.6666666666, ans=0.1 2023-10-11 16:47:53,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=769958.0, ans=0.125 2023-10-11 16:48:13,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770051.3333333334, ans=0.1 2023-10-11 16:48:49,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=770144.6666666666, ans=0.125 2023-10-11 16:48:54,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=770144.6666666666, ans=0.5 2023-10-11 16:49:14,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.646e+02 1.802e+02 2.045e+02 3.030e+02, threshold=3.603e+02, percent-clipped=0.0 2023-10-11 16:49:19,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=770284.6666666666, ans=0.125 2023-10-11 16:49:23,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=22.5 2023-10-11 16:49:35,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=770331.3333333334, ans=0.2 2023-10-11 16:49:47,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=770378.0, ans=0.125 2023-10-11 16:49:49,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=770378.0, ans=0.125 2023-10-11 16:49:58,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=770424.6666666666, ans=0.125 2023-10-11 16:50:05,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-10-11 16:50:06,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.40 vs. limit=15.0 2023-10-11 16:50:37,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=770611.3333333334, ans=0.125 2023-10-11 16:50:47,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=770658.0, ans=0.125 2023-10-11 16:50:53,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=770658.0, ans=0.0 2023-10-11 16:50:55,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=770704.6666666666, ans=0.125 2023-10-11 16:51:04,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.660e+02 1.834e+02 2.035e+02 3.154e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-11 16:51:26,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=12.0 2023-10-11 16:51:40,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.13 vs. limit=15.0 2023-10-11 16:51:43,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=770891.3333333334, ans=0.125 2023-10-11 16:51:47,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=770891.3333333334, ans=0.0 2023-10-11 16:52:09,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=770984.6666666666, ans=0.0 2023-10-11 16:52:11,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=770984.6666666666, ans=0.125 2023-10-11 16:52:30,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=15.0 2023-10-11 16:52:57,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.296e+02 1.752e+02 1.889e+02 2.193e+02 3.124e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 16:53:06,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=771218.0, ans=0.2 2023-10-11 16:53:24,942 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:53:39,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=771358.0, ans=0.125 2023-10-11 16:53:51,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=771404.6666666666, ans=0.0 2023-10-11 16:54:29,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=771591.3333333334, ans=0.125 2023-10-11 16:54:43,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=771638.0, ans=0.2 2023-10-11 16:54:44,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=771638.0, ans=0.2 2023-10-11 16:54:51,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.701e+02 1.885e+02 2.028e+02 2.925e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-11 16:54:54,082 INFO [train.py:1031] (3/4) Epoch 13, batch 1500, loss[loss=0.2302, simple_loss=0.3004, pruned_loss=0.08, over 15686.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2891, pruned_loss=0.05526, over 17343277.40 frames. ], batch size: 350, lr: 2.76e-03, grad_scale: 32.0 2023-10-11 16:55:00,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=771684.6666666666, ans=0.0 2023-10-11 16:55:06,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.71 vs. limit=22.5 2023-10-11 16:55:14,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=771778.0, ans=0.125 2023-10-11 16:55:26,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=771824.6666666666, ans=0.0 2023-10-11 16:55:32,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=771824.6666666666, ans=0.125 2023-10-11 16:55:40,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=771871.3333333334, ans=0.125 2023-10-11 16:55:53,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=771918.0, ans=0.05 2023-10-11 16:56:21,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.13 vs. limit=10.0 2023-10-11 16:56:22,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=772058.0, ans=0.07 2023-10-11 16:56:32,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=772058.0, ans=0.125 2023-10-11 16:56:43,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.716e+02 1.889e+02 2.098e+02 3.219e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 16:56:54,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=772151.3333333334, ans=0.1 2023-10-11 16:57:02,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=772198.0, ans=0.125 2023-10-11 16:57:06,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2023-10-11 16:57:14,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=15.0 2023-10-11 16:57:15,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772244.6666666666, ans=0.1 2023-10-11 16:57:43,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=772338.0, ans=0.0 2023-10-11 16:57:45,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=772338.0, ans=0.0 2023-10-11 16:58:05,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-10-11 16:58:19,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=772478.0, ans=0.125 2023-10-11 16:58:19,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=772478.0, ans=0.0 2023-10-11 16:58:28,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=772524.6666666666, ans=0.0 2023-10-11 16:58:35,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=772571.3333333334, ans=0.2 2023-10-11 16:58:42,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.617e+02 1.832e+02 2.084e+02 3.024e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-11 16:58:45,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=772618.0, ans=0.0 2023-10-11 16:59:08,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.11 vs. limit=15.0 2023-10-11 16:59:16,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.47 vs. limit=15.0 2023-10-11 17:00:01,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=772944.6666666666, ans=0.2 2023-10-11 17:00:01,974 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.11 vs. limit=15.0 2023-10-11 17:00:21,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=772991.3333333334, ans=0.0 2023-10-11 17:00:29,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=773038.0, ans=0.125 2023-10-11 17:00:32,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.683e+02 1.885e+02 2.125e+02 3.193e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 17:00:37,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=773084.6666666666, ans=0.0 2023-10-11 17:00:52,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=773131.3333333334, ans=0.125 2023-10-11 17:00:56,466 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:01:08,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-11 17:01:15,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=15.0 2023-10-11 17:01:18,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773271.3333333334, ans=0.1 2023-10-11 17:01:21,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=773271.3333333334, ans=0.125 2023-10-11 17:01:22,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.61 vs. limit=22.5 2023-10-11 17:01:43,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.31 vs. limit=15.0 2023-10-11 17:01:55,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=773411.3333333334, ans=0.2 2023-10-11 17:02:08,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=773458.0, ans=0.125 2023-10-11 17:02:18,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=773504.6666666666, ans=0.125 2023-10-11 17:02:22,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.681e+02 1.884e+02 2.076e+02 2.770e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 17:02:37,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=773598.0, ans=0.125 2023-10-11 17:02:38,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=773598.0, ans=15.0 2023-10-11 17:02:40,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=773598.0, ans=0.125 2023-10-11 17:02:44,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=773644.6666666666, ans=0.0 2023-10-11 17:02:50,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=773644.6666666666, ans=0.0 2023-10-11 17:02:52,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-10-11 17:03:07,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=773691.3333333334, ans=0.125 2023-10-11 17:03:11,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=773738.0, ans=0.0 2023-10-11 17:03:29,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-10-11 17:03:31,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773784.6666666666, ans=0.1 2023-10-11 17:03:52,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-10-11 17:03:54,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-10-11 17:04:24,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773971.3333333334, ans=0.1 2023-10-11 17:04:28,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.638e+02 1.768e+02 2.069e+02 2.955e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-11 17:04:30,858 INFO [train.py:1031] (3/4) Epoch 13, batch 2000, loss[loss=0.2041, simple_loss=0.2984, pruned_loss=0.05493, over 16709.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2897, pruned_loss=0.05539, over 20743713.71 frames. ], batch size: 81, lr: 2.76e-03, grad_scale: 32.0 2023-10-11 17:04:36,902 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=15.0 2023-10-11 17:04:40,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=774064.6666666666, ans=0.0 2023-10-11 17:04:41,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=774064.6666666666, ans=0.1 2023-10-11 17:04:47,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=774064.6666666666, ans=0.0 2023-10-11 17:04:53,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=774111.3333333334, ans=0.2 2023-10-11 17:05:36,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=774251.3333333334, ans=0.0 2023-10-11 17:05:44,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-10-11 17:05:51,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=774298.0, ans=0.125 2023-10-11 17:05:53,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=774298.0, ans=0.0 2023-10-11 17:06:26,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=774391.3333333334, ans=0.125 2023-10-11 17:06:44,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.680e+02 1.868e+02 2.098e+02 3.054e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 17:07:01,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-11 17:07:06,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=774531.3333333334, ans=0.125 2023-10-11 17:07:15,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=774531.3333333334, ans=0.125 2023-10-11 17:07:17,531 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=15.0 2023-10-11 17:07:34,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.75 vs. limit=15.0 2023-10-11 17:08:13,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=774718.0, ans=0.0 2023-10-11 17:08:14,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=774718.0, ans=0.125 2023-10-11 17:08:56,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=774858.0, ans=0.0 2023-10-11 17:09:08,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-11 17:09:09,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-10-11 17:09:10,266 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.747e+02 1.980e+02 2.145e+02 2.940e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-11 17:09:14,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=774951.3333333334, ans=0.2 2023-10-11 17:09:29,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=22.5 2023-10-11 17:09:31,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=774998.0, ans=0.125 2023-10-11 17:09:46,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.87 vs. limit=15.0 2023-10-11 17:09:54,358 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-11 17:09:58,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=775138.0, ans=0.2 2023-10-11 17:10:05,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775184.6666666666, ans=0.1 2023-10-11 17:10:09,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-10-11 17:10:13,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-10-11 17:10:20,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=775231.3333333334, ans=0.1 2023-10-11 17:10:22,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775231.3333333334, ans=0.1 2023-10-11 17:10:33,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-10-11 17:10:57,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.726e+02 1.937e+02 2.179e+02 3.134e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-11 17:11:02,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.80 vs. limit=15.0 2023-10-11 17:11:05,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=775418.0, ans=0.125 2023-10-11 17:11:08,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=775464.6666666666, ans=0.0 2023-10-11 17:11:37,902 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-10-11 17:11:47,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=775604.6666666666, ans=0.2 2023-10-11 17:11:55,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=775651.3333333334, ans=0.125 2023-10-11 17:12:19,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.03 vs. limit=10.0 2023-10-11 17:12:31,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=775791.3333333334, ans=0.125 2023-10-11 17:12:38,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=775791.3333333334, ans=0.125 2023-10-11 17:12:50,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.735e+02 1.913e+02 2.191e+02 3.165e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-11 17:13:01,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=775884.6666666666, ans=0.125 2023-10-11 17:13:04,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=775931.3333333334, ans=0.125 2023-10-11 17:13:10,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.18 vs. limit=15.0 2023-10-11 17:13:25,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=776024.6666666666, ans=0.1 2023-10-11 17:13:53,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=776118.0, ans=0.125 2023-10-11 17:14:11,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.78 vs. limit=15.0 2023-10-11 17:14:20,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=776258.0, ans=0.0 2023-10-11 17:14:41,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.730e+02 1.888e+02 2.296e+02 3.046e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-11 17:14:42,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=776351.3333333334, ans=0.05 2023-10-11 17:14:42,829 INFO [train.py:1031] (3/4) Epoch 13, batch 2500, loss[loss=0.1921, simple_loss=0.2782, pruned_loss=0.05299, over 16897.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2897, pruned_loss=0.05556, over 23404699.01 frames. ], batch size: 72, lr: 2.75e-03, grad_scale: 32.0 2023-10-11 17:14:46,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=776351.3333333334, ans=0.125 2023-10-11 17:14:53,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=776398.0, ans=0.0 2023-10-11 17:15:32,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=776538.0, ans=0.07 2023-10-11 17:16:07,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-10-11 17:16:13,534 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-11 17:16:22,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=776771.3333333334, ans=0.125 2023-10-11 17:16:31,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=776771.3333333334, ans=0.125 2023-10-11 17:16:32,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.751e+02 1.965e+02 2.216e+02 2.740e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-11 17:16:38,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.56 vs. limit=22.5 2023-10-11 17:17:02,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=776911.3333333334, ans=0.1 2023-10-11 17:17:02,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=776911.3333333334, ans=0.125 2023-10-11 17:17:02,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2023-10-11 17:17:09,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=776958.0, ans=0.125 2023-10-11 17:17:24,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.35 vs. limit=15.0 2023-10-11 17:17:28,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=777004.6666666666, ans=0.09899494936611666 2023-10-11 17:17:30,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-10-11 17:17:31,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=15.0 2023-10-11 17:17:31,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=777051.3333333334, ans=0.125 2023-10-11 17:17:36,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=777051.3333333334, ans=0.125 2023-10-11 17:17:39,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777051.3333333334, ans=0.1 2023-10-11 17:18:03,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=777144.6666666666, ans=6.0 2023-10-11 17:18:04,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=777144.6666666666, ans=0.125 2023-10-11 17:18:35,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.672e+02 1.796e+02 2.047e+02 3.044e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-11 17:18:36,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777284.6666666666, ans=0.1 2023-10-11 17:19:18,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=777424.6666666666, ans=0.2 2023-10-11 17:19:21,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=777424.6666666666, ans=0.1 2023-10-11 17:19:40,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=777518.0, ans=0.07 2023-10-11 17:19:52,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=777564.6666666666, ans=0.1 2023-10-11 17:19:55,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=777564.6666666666, ans=0.0 2023-10-11 17:19:59,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-11 17:20:19,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.13 vs. limit=15.0 2023-10-11 17:20:23,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=777658.0, ans=0.125 2023-10-11 17:20:28,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=777704.6666666666, ans=0.125 2023-10-11 17:20:28,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777704.6666666666, ans=0.1 2023-10-11 17:20:33,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777704.6666666666, ans=0.1 2023-10-11 17:20:34,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=777704.6666666666, ans=0.125 2023-10-11 17:20:39,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.609e+02 1.790e+02 2.044e+02 2.889e+02, threshold=3.579e+02, percent-clipped=0.0 2023-10-11 17:20:39,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=777751.3333333334, ans=0.125 2023-10-11 17:20:56,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=777798.0, ans=0.0 2023-10-11 17:21:04,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=12.0 2023-10-11 17:21:05,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-11 17:21:06,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=777844.6666666666, ans=0.0 2023-10-11 17:21:13,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=777891.3333333334, ans=0.125 2023-10-11 17:21:39,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-10-11 17:21:41,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=777984.6666666666, ans=0.0 2023-10-11 17:21:41,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=777984.6666666666, ans=0.0 2023-10-11 17:21:46,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=777984.6666666666, ans=0.1 2023-10-11 17:21:59,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=778031.3333333334, ans=0.0 2023-10-11 17:22:28,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=778124.6666666666, ans=0.125 2023-10-11 17:22:32,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=778171.3333333334, ans=0.125 2023-10-11 17:22:34,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=778171.3333333334, ans=0.0 2023-10-11 17:22:44,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.725e+02 1.875e+02 2.079e+02 2.847e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 17:22:56,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778218.0, ans=0.125 2023-10-11 17:23:13,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=778311.3333333334, ans=0.1 2023-10-11 17:23:41,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=778404.6666666666, ans=0.1 2023-10-11 17:23:53,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-11 17:23:54,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=778451.3333333334, ans=0.0 2023-10-11 17:24:03,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778498.0, ans=0.125 2023-10-11 17:24:04,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=778498.0, ans=0.125 2023-10-11 17:24:17,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=778544.6666666666, ans=0.125 2023-10-11 17:24:17,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=778544.6666666666, ans=0.125 2023-10-11 17:24:42,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.677e+02 1.775e+02 1.937e+02 2.601e+02, threshold=3.551e+02, percent-clipped=0.0 2023-10-11 17:24:42,914 INFO [train.py:1031] (3/4) Epoch 13, batch 3000, loss[loss=0.1896, simple_loss=0.2819, pruned_loss=0.04865, over 16838.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2889, pruned_loss=0.05515, over 25515205.28 frames. ], batch size: 188, lr: 2.75e-03, grad_scale: 16.0 2023-10-11 17:25:04,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=778778.0, ans=0.04949747468305833 2023-10-11 17:25:09,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=778778.0, ans=0.2 2023-10-11 17:25:19,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.37 vs. limit=15.0 2023-10-11 17:25:29,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778871.3333333334, ans=0.125 2023-10-11 17:25:48,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-10-11 17:26:00,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.81 vs. limit=15.0 2023-10-11 17:26:09,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=779011.3333333334, ans=0.125 2023-10-11 17:26:17,335 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-10-11 17:26:32,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=779104.6666666666, ans=0.125 2023-10-11 17:26:42,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.738e+02 1.954e+02 2.228e+02 3.012e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-11 17:27:20,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779291.3333333334, ans=0.1 2023-10-11 17:27:24,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=779291.3333333334, ans=0.125 2023-10-11 17:27:32,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=779338.0, ans=0.1 2023-10-11 17:27:38,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=779338.0, ans=0.0 2023-10-11 17:27:52,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=779431.3333333334, ans=0.125 2023-10-11 17:27:53,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=779431.3333333334, ans=0.95 2023-10-11 17:28:00,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=779431.3333333334, ans=0.125 2023-10-11 17:28:16,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=779478.0, ans=0.0 2023-10-11 17:28:40,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.642e+02 1.814e+02 2.078e+02 2.832e+02, threshold=3.627e+02, percent-clipped=0.0 2023-10-11 17:28:41,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779618.0, ans=0.1 2023-10-11 17:28:48,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=779618.0, ans=0.1 2023-10-11 17:28:51,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=779664.6666666666, ans=0.1 2023-10-11 17:28:53,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=15.0 2023-10-11 17:28:55,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=779664.6666666666, ans=0.125 2023-10-11 17:29:00,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=779664.6666666666, ans=0.125 2023-10-11 17:29:16,998 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:29:45,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=779851.3333333334, ans=0.0 2023-10-11 17:29:59,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=779898.0, ans=0.2 2023-10-11 17:30:22,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=779944.6666666666, ans=0.125 2023-10-11 17:30:34,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=779991.3333333334, ans=0.0 2023-10-11 17:30:49,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.683e+02 1.870e+02 2.036e+02 2.917e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 17:31:01,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=780131.3333333334, ans=0.0 2023-10-11 17:31:02,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=780131.3333333334, ans=0.02 2023-10-11 17:31:16,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=780178.0, ans=0.0 2023-10-11 17:31:18,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=780178.0, ans=0.2 2023-10-11 17:31:31,364 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:31:34,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=780271.3333333334, ans=0.2 2023-10-11 17:31:48,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=780318.0, ans=0.125 2023-10-11 17:31:54,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=780364.6666666666, ans=0.07 2023-10-11 17:31:54,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=780364.6666666666, ans=0.07 2023-10-11 17:31:54,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=780364.6666666666, ans=0.07 2023-10-11 17:32:14,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=780411.3333333334, ans=0.09899494936611666 2023-10-11 17:32:30,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-10-11 17:32:44,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=780551.3333333334, ans=0.035 2023-10-11 17:32:45,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.734e+02 1.884e+02 2.059e+02 3.236e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-11 17:32:45,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780551.3333333334, ans=0.1 2023-10-11 17:32:49,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780551.3333333334, ans=0.1 2023-10-11 17:32:56,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=780598.0, ans=0.05 2023-10-11 17:33:29,141 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:33:44,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=780784.6666666666, ans=0.2 2023-10-11 17:33:49,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-10-11 17:34:04,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=780878.0, ans=0.2 2023-10-11 17:34:21,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=780924.6666666666, ans=0.2 2023-10-11 17:34:31,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=780971.3333333334, ans=0.125 2023-10-11 17:34:36,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=780971.3333333334, ans=0.125 2023-10-11 17:34:38,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=780971.3333333334, ans=0.0 2023-10-11 17:34:39,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.66 vs. limit=15.0 2023-10-11 17:34:41,073 INFO [train.py:1031] (3/4) Epoch 13, batch 3500, loss[loss=0.2051, simple_loss=0.2996, pruned_loss=0.05531, over 16808.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2889, pruned_loss=0.05548, over 27096151.80 frames. ], batch size: 175, lr: 2.74e-03, grad_scale: 16.0 2023-10-11 17:34:42,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.705e+02 1.908e+02 2.146e+02 2.814e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 17:34:45,985 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:34:50,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-10-11 17:35:00,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=781064.6666666666, ans=0.05 2023-10-11 17:35:00,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=781064.6666666666, ans=0.0 2023-10-11 17:35:30,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-10-11 17:36:11,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=781344.6666666666, ans=0.125 2023-10-11 17:36:27,749 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.29 vs. limit=15.0 2023-10-11 17:36:36,048 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-11 17:36:46,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.731e+02 1.931e+02 2.128e+02 3.547e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 17:36:51,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=781484.6666666666, ans=0.0 2023-10-11 17:36:53,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=781484.6666666666, ans=0.1 2023-10-11 17:36:54,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=781484.6666666666, ans=0.0 2023-10-11 17:37:24,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=781624.6666666666, ans=0.0 2023-10-11 17:37:27,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=781624.6666666666, ans=0.125 2023-10-11 17:38:08,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=781811.3333333334, ans=0.1 2023-10-11 17:38:11,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=781811.3333333334, ans=0.0 2023-10-11 17:38:11,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-10-11 17:38:40,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.666e+02 1.787e+02 1.979e+02 3.025e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 17:39:10,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=782044.6666666666, ans=0.125 2023-10-11 17:40:11,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=782278.0, ans=0.125 2023-10-11 17:40:21,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=782324.6666666666, ans=0.0 2023-10-11 17:40:25,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=782324.6666666666, ans=0.0 2023-10-11 17:40:33,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=782371.3333333334, ans=0.125 2023-10-11 17:40:39,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=782371.3333333334, ans=0.1 2023-10-11 17:40:43,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.700e+02 1.874e+02 2.110e+02 2.990e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 17:41:01,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=782464.6666666666, ans=0.0 2023-10-11 17:41:10,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-10-11 17:41:25,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=782558.0, ans=0.1 2023-10-11 17:41:51,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=782698.0, ans=0.2 2023-10-11 17:41:51,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=782698.0, ans=0.125 2023-10-11 17:41:53,453 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:41:56,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=782698.0, ans=0.125 2023-10-11 17:42:11,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=782791.3333333334, ans=0.125 2023-10-11 17:42:15,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=782791.3333333334, ans=0.0 2023-10-11 17:42:25,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=782838.0, ans=0.0 2023-10-11 17:42:32,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.656e+02 1.815e+02 2.082e+02 3.300e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-11 17:42:34,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=782884.6666666666, ans=0.125 2023-10-11 17:42:42,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-11 17:42:59,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=782978.0, ans=0.125 2023-10-11 17:43:03,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=783024.6666666666, ans=0.125 2023-10-11 17:43:25,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.13 vs. limit=15.0 2023-10-11 17:43:34,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=783164.6666666666, ans=0.0 2023-10-11 17:43:34,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=783164.6666666666, ans=0.0 2023-10-11 17:43:34,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=783164.6666666666, ans=0.0 2023-10-11 17:43:41,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783164.6666666666, ans=0.1 2023-10-11 17:43:44,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=783211.3333333334, ans=0.125 2023-10-11 17:43:53,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=783258.0, ans=0.1 2023-10-11 17:43:54,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=783258.0, ans=0.0 2023-10-11 17:43:57,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783258.0, ans=0.1 2023-10-11 17:44:04,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=783304.6666666666, ans=0.125 2023-10-11 17:44:09,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.79 vs. limit=15.0 2023-10-11 17:44:16,774 INFO [train.py:1031] (3/4) Epoch 13, batch 4000, loss[loss=0.213, simple_loss=0.3119, pruned_loss=0.05706, over 16694.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2889, pruned_loss=0.05591, over 28332379.56 frames. ], batch size: 202, lr: 2.74e-03, grad_scale: 32.0 2023-10-11 17:44:18,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.745e+02 1.949e+02 2.315e+02 3.708e+02, threshold=3.898e+02, percent-clipped=2.0 2023-10-11 17:45:04,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=783538.0, ans=0.0 2023-10-11 17:45:06,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-10-11 17:45:27,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-10-11 17:45:32,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=783631.3333333334, ans=0.0 2023-10-11 17:45:41,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=783678.0, ans=0.125 2023-10-11 17:45:56,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=783771.3333333334, ans=0.015 2023-10-11 17:45:57,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=783771.3333333334, ans=0.0 2023-10-11 17:46:07,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=783818.0, ans=0.125 2023-10-11 17:46:09,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.734e+02 1.882e+02 2.088e+02 2.775e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-11 17:46:15,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=783818.0, ans=0.95 2023-10-11 17:46:17,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-10-11 17:46:18,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=783864.6666666666, ans=0.0 2023-10-11 17:46:18,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=783864.6666666666, ans=0.125 2023-10-11 17:46:53,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783958.0, ans=0.1 2023-10-11 17:47:05,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=784004.6666666666, ans=0.035 2023-10-11 17:47:12,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-10-11 17:47:15,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=784051.3333333334, ans=0.0 2023-10-11 17:47:36,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=784144.6666666666, ans=0.125 2023-10-11 17:47:40,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=784144.6666666666, ans=0.125 2023-10-11 17:48:08,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=784238.0, ans=0.0 2023-10-11 17:48:18,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.67 vs. limit=22.5 2023-10-11 17:48:19,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-10-11 17:48:20,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.649e+02 1.822e+02 2.079e+02 2.698e+02, threshold=3.643e+02, percent-clipped=0.0 2023-10-11 17:48:24,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-10-11 17:49:16,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-10-11 17:49:28,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=784564.6666666666, ans=0.125 2023-10-11 17:49:31,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=784564.6666666666, ans=0.125 2023-10-11 17:49:40,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=784611.3333333334, ans=0.1 2023-10-11 17:49:43,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=784658.0, ans=0.125 2023-10-11 17:50:07,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=784751.3333333334, ans=0.0 2023-10-11 17:50:08,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.789e+02 2.022e+02 2.406e+02 3.663e+02, threshold=4.044e+02, percent-clipped=1.0 2023-10-11 17:50:24,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=784798.0, ans=0.125 2023-10-11 17:50:30,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.95 vs. limit=22.5 2023-10-11 17:50:55,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=784938.0, ans=0.1 2023-10-11 17:50:58,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=784938.0, ans=0.1 2023-10-11 17:51:32,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=785078.0, ans=0.125 2023-10-11 17:52:02,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.715e+02 1.902e+02 2.068e+02 2.692e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 17:52:06,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=785218.0, ans=0.0 2023-10-11 17:52:37,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=785358.0, ans=0.125 2023-10-11 17:53:03,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785404.6666666666, ans=0.1 2023-10-11 17:53:18,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785498.0, ans=0.1 2023-10-11 17:53:26,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=785498.0, ans=0.1 2023-10-11 17:53:47,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=785591.3333333334, ans=0.125 2023-10-11 17:54:01,854 INFO [train.py:1031] (3/4) Epoch 13, batch 4500, loss[loss=0.1584, simple_loss=0.2551, pruned_loss=0.03082, over 16883.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2891, pruned_loss=0.05552, over 29350082.85 frames. ], batch size: 93, lr: 2.74e-03, grad_scale: 32.0 2023-10-11 17:54:05,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.714e+02 1.895e+02 2.126e+02 2.850e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-11 17:54:37,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=785824.6666666666, ans=0.0 2023-10-11 17:54:42,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=785871.3333333334, ans=0.5 2023-10-11 17:54:48,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=785871.3333333334, ans=0.2 2023-10-11 17:55:14,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-11 17:55:28,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2023-10-11 17:55:47,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.662e+02 1.817e+02 2.086e+02 2.666e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-11 17:55:59,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=786198.0, ans=0.05 2023-10-11 17:56:03,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=786198.0, ans=0.125 2023-10-11 17:56:08,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=786244.6666666666, ans=15.0 2023-10-11 17:56:11,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=786244.6666666666, ans=0.05 2023-10-11 17:56:11,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=15.0 2023-10-11 17:56:12,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=786244.6666666666, ans=0.0 2023-10-11 17:56:18,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-10-11 17:56:30,295 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-10-11 17:56:33,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=786338.0, ans=0.2 2023-10-11 17:56:34,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=786338.0, ans=0.0 2023-10-11 17:56:35,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=12.0 2023-10-11 17:57:19,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-10-11 17:57:35,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.724e+02 1.862e+02 2.083e+02 2.841e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-11 17:57:45,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=786664.6666666666, ans=0.1 2023-10-11 17:58:09,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=786758.0, ans=0.125 2023-10-11 17:58:12,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=786804.6666666666, ans=0.125 2023-10-11 17:58:32,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=786851.3333333334, ans=0.125 2023-10-11 17:58:39,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=786898.0, ans=0.0 2023-10-11 17:58:50,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=786944.6666666666, ans=0.125 2023-10-11 17:58:56,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.05 vs. limit=22.5 2023-10-11 17:59:00,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=786991.3333333334, ans=0.0 2023-10-11 17:59:23,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.740e+02 1.867e+02 2.029e+02 2.533e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 17:59:37,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=787131.3333333334, ans=0.125 2023-10-11 17:59:49,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=787178.0, ans=0.0 2023-10-11 17:59:52,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2023-10-11 18:00:01,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=787224.6666666666, ans=0.2 2023-10-11 18:00:21,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=787271.3333333334, ans=0.125 2023-10-11 18:00:21,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=787271.3333333334, ans=0.0 2023-10-11 18:00:30,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=787318.0, ans=0.125 2023-10-11 18:00:38,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=787364.6666666666, ans=0.125 2023-10-11 18:00:38,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-10-11 18:00:48,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=787411.3333333334, ans=0.125 2023-10-11 18:00:54,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=787458.0, ans=0.125 2023-10-11 18:01:09,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=787504.6666666666, ans=0.125 2023-10-11 18:01:19,019 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:01:23,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.645e+02 1.780e+02 1.964e+02 2.436e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-11 18:01:36,511 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=12.0 2023-10-11 18:02:17,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=787738.0, ans=0.2 2023-10-11 18:02:28,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=787784.6666666666, ans=0.125 2023-10-11 18:02:58,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=787924.6666666666, ans=0.1 2023-10-11 18:03:16,755 INFO [train.py:1031] (3/4) Epoch 13, batch 5000, loss[loss=0.1938, simple_loss=0.2841, pruned_loss=0.0518, over 16855.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2888, pruned_loss=0.05556, over 30127861.30 frames. ], batch size: 72, lr: 2.73e-03, grad_scale: 32.0 2023-10-11 18:03:19,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.736e+02 1.980e+02 2.174e+02 3.063e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-11 18:03:26,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=788018.0, ans=0.0 2023-10-11 18:03:30,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=788064.6666666666, ans=0.0 2023-10-11 18:03:38,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2023-10-11 18:03:42,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-10-11 18:03:45,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=788111.3333333334, ans=0.125 2023-10-11 18:04:17,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=788251.3333333334, ans=0.2 2023-10-11 18:04:25,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=788298.0, ans=0.0 2023-10-11 18:04:46,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=788391.3333333334, ans=0.2 2023-10-11 18:05:01,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-10-11 18:05:04,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=788438.0, ans=0.0 2023-10-11 18:05:14,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.58 vs. limit=15.0 2023-10-11 18:05:15,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.725e+02 1.868e+02 2.180e+02 2.976e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 18:05:17,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=8.0 2023-10-11 18:05:20,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=788484.6666666666, ans=0.125 2023-10-11 18:05:21,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=788484.6666666666, ans=0.125 2023-10-11 18:05:23,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=788531.3333333334, ans=0.125 2023-10-11 18:05:25,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=788531.3333333334, ans=0.125 2023-10-11 18:05:27,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-10-11 18:05:34,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=788578.0, ans=0.0 2023-10-11 18:05:35,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=788578.0, ans=0.0 2023-10-11 18:05:42,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-10-11 18:05:47,387 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:05:50,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=788624.6666666666, ans=0.125 2023-10-11 18:05:55,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=788671.3333333334, ans=0.0 2023-10-11 18:06:18,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.94 vs. limit=15.0 2023-10-11 18:06:37,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.90 vs. limit=12.0 2023-10-11 18:06:45,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-11 18:06:59,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-10-11 18:07:06,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.676e+02 1.804e+02 1.948e+02 2.868e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-11 18:07:09,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788951.3333333334, ans=0.1 2023-10-11 18:07:17,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=788998.0, ans=0.125 2023-10-11 18:07:24,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=789044.6666666666, ans=0.1 2023-10-11 18:07:34,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=789091.3333333334, ans=0.5 2023-10-11 18:07:34,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=789091.3333333334, ans=0.125 2023-10-11 18:07:36,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=789091.3333333334, ans=0.0 2023-10-11 18:07:48,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=789138.0, ans=0.0 2023-10-11 18:08:05,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=789184.6666666666, ans=0.125 2023-10-11 18:08:17,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=789231.3333333334, ans=0.125 2023-10-11 18:08:25,733 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:08:27,600 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:08:33,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=789324.6666666666, ans=0.125 2023-10-11 18:08:38,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=789324.6666666666, ans=0.125 2023-10-11 18:08:44,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.04 vs. limit=15.0 2023-10-11 18:08:55,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=789418.0, ans=0.0 2023-10-11 18:08:59,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.705e+02 1.859e+02 2.067e+02 3.041e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-11 18:09:52,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=22.5 2023-10-11 18:10:13,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=789698.0, ans=0.2 2023-10-11 18:10:18,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=22.5 2023-10-11 18:10:22,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-10-11 18:10:30,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=789791.3333333334, ans=15.0 2023-10-11 18:10:33,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=789791.3333333334, ans=0.05 2023-10-11 18:10:52,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.252e+02 1.588e+02 1.755e+02 2.001e+02 3.148e+02, threshold=3.511e+02, percent-clipped=0.0 2023-10-11 18:10:57,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=789884.6666666666, ans=0.125 2023-10-11 18:11:02,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=789931.3333333334, ans=0.0 2023-10-11 18:11:10,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=789978.0, ans=0.0 2023-10-11 18:11:17,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=789978.0, ans=0.2 2023-10-11 18:11:28,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-11 18:11:48,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=790118.0, ans=0.1 2023-10-11 18:11:58,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=790164.6666666666, ans=0.125 2023-10-11 18:12:15,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=790258.0, ans=0.1 2023-10-11 18:12:17,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=790258.0, ans=0.2 2023-10-11 18:12:34,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=790304.6666666666, ans=0.125 2023-10-11 18:12:35,874 INFO [train.py:1031] (3/4) Epoch 13, batch 5500, loss[loss=0.187, simple_loss=0.2823, pruned_loss=0.04579, over 16864.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2886, pruned_loss=0.05545, over 30723025.58 frames. ], batch size: 138, lr: 2.73e-03, grad_scale: 16.0 2023-10-11 18:12:39,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.709e+02 1.875e+02 2.088e+02 3.135e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-11 18:13:04,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=790444.6666666666, ans=0.125 2023-10-11 18:13:32,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-10-11 18:13:54,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=790678.0, ans=0.0 2023-10-11 18:13:57,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=790678.0, ans=0.125 2023-10-11 18:13:59,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=790678.0, ans=0.1 2023-10-11 18:14:00,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=790678.0, ans=0.125 2023-10-11 18:14:02,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=790678.0, ans=0.1 2023-10-11 18:14:02,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=790678.0, ans=0.125 2023-10-11 18:14:24,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-10-11 18:14:28,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.677e+02 1.810e+02 2.036e+02 3.466e+02, threshold=3.621e+02, percent-clipped=0.0 2023-10-11 18:14:39,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=790864.6666666666, ans=0.0 2023-10-11 18:14:39,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=790864.6666666666, ans=0.0 2023-10-11 18:14:50,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=790911.3333333334, ans=0.1 2023-10-11 18:14:54,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=790911.3333333334, ans=0.125 2023-10-11 18:15:05,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=790958.0, ans=0.125 2023-10-11 18:15:18,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=791004.6666666666, ans=0.125 2023-10-11 18:15:19,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=791004.6666666666, ans=0.2 2023-10-11 18:16:05,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=791191.3333333334, ans=0.0 2023-10-11 18:16:10,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=791238.0, ans=0.125 2023-10-11 18:16:25,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.740e+02 1.953e+02 2.271e+02 2.998e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-11 18:16:35,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2023-10-11 18:16:46,039 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-11 18:16:46,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.40 vs. limit=10.0 2023-10-11 18:17:06,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=791471.3333333334, ans=0.0 2023-10-11 18:17:13,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=791471.3333333334, ans=0.125 2023-10-11 18:17:15,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=791518.0, ans=0.0 2023-10-11 18:17:19,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=791518.0, ans=0.0 2023-10-11 18:17:39,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-10-11 18:17:50,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=791658.0, ans=0.125 2023-10-11 18:17:57,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=791658.0, ans=0.1 2023-10-11 18:18:14,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-10-11 18:18:17,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.672e+02 1.808e+02 1.993e+02 3.066e+02, threshold=3.617e+02, percent-clipped=0.0 2023-10-11 18:18:20,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791751.3333333334, ans=0.1 2023-10-11 18:18:42,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=791844.6666666666, ans=0.0 2023-10-11 18:18:52,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=791891.3333333334, ans=0.0 2023-10-11 18:19:15,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=791984.6666666666, ans=0.2 2023-10-11 18:19:19,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=791984.6666666666, ans=0.125 2023-10-11 18:19:33,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.04 vs. limit=10.0 2023-10-11 18:19:44,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=792124.6666666666, ans=0.125 2023-10-11 18:19:51,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=792124.6666666666, ans=0.2 2023-10-11 18:19:58,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=792171.3333333334, ans=0.125 2023-10-11 18:20:11,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.658e+02 1.820e+02 2.036e+02 3.402e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-11 18:20:15,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792264.6666666666, ans=0.1 2023-10-11 18:20:47,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=792358.0, ans=0.1 2023-10-11 18:20:48,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=792358.0, ans=0.07 2023-10-11 18:20:53,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=792404.6666666666, ans=0.2 2023-10-11 18:21:05,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.71 vs. limit=10.0 2023-10-11 18:21:06,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=792451.3333333334, ans=0.2 2023-10-11 18:21:30,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=792544.6666666666, ans=0.125 2023-10-11 18:21:45,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=792591.3333333334, ans=0.125 2023-10-11 18:21:59,066 INFO [train.py:1031] (3/4) Epoch 13, batch 6000, loss[loss=0.2581, simple_loss=0.3209, pruned_loss=0.09763, over 15653.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.289, pruned_loss=0.05572, over 31173900.12 frames. ], batch size: 350, lr: 2.72e-03, grad_scale: 32.0 2023-10-11 18:22:01,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=792684.6666666666, ans=0.125 2023-10-11 18:22:04,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.726e+02 1.929e+02 2.173e+02 3.653e+02, threshold=3.859e+02, percent-clipped=1.0 2023-10-11 18:22:05,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-10-11 18:22:14,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792731.3333333334, ans=0.1 2023-10-11 18:22:21,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=792778.0, ans=0.125 2023-10-11 18:22:22,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=792778.0, ans=0.0 2023-10-11 18:22:26,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=792778.0, ans=0.125 2023-10-11 18:22:28,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=792778.0, ans=0.125 2023-10-11 18:22:41,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792824.6666666666, ans=0.1 2023-10-11 18:23:09,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-10-11 18:23:11,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=792964.6666666666, ans=0.125 2023-10-11 18:23:14,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=792964.6666666666, ans=0.125 2023-10-11 18:23:30,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=793058.0, ans=0.125 2023-10-11 18:23:40,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=793104.6666666666, ans=0.5 2023-10-11 18:23:40,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=793104.6666666666, ans=0.125 2023-10-11 18:23:45,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-10-11 18:23:54,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.705e+02 1.851e+02 2.019e+02 3.224e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-11 18:24:00,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=793198.0, ans=0.125 2023-10-11 18:24:14,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=793244.6666666666, ans=0.125 2023-10-11 18:24:21,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=793291.3333333334, ans=0.125 2023-10-11 18:24:27,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=793291.3333333334, ans=0.09899494936611666 2023-10-11 18:24:33,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.62 vs. limit=22.5 2023-10-11 18:24:34,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=793338.0, ans=10.0 2023-10-11 18:24:36,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=793338.0, ans=0.125 2023-10-11 18:24:39,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.77 vs. limit=15.0 2023-10-11 18:24:44,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793384.6666666666, ans=0.1 2023-10-11 18:24:47,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=793384.6666666666, ans=0.125 2023-10-11 18:24:56,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=793431.3333333334, ans=0.125 2023-10-11 18:24:56,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=793431.3333333334, ans=0.125 2023-10-11 18:24:57,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=793431.3333333334, ans=0.2 2023-10-11 18:25:09,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-11 18:25:21,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=793524.6666666666, ans=0.125 2023-10-11 18:25:43,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.776e+02 1.920e+02 2.180e+02 2.938e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-11 18:25:50,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=793664.6666666666, ans=0.1 2023-10-11 18:26:08,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=793711.3333333334, ans=0.0 2023-10-11 18:26:21,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=793758.0, ans=0.125 2023-10-11 18:26:24,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=793804.6666666666, ans=0.125 2023-10-11 18:26:41,712 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-11 18:26:48,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=793898.0, ans=0.035 2023-10-11 18:27:14,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=22.5 2023-10-11 18:27:19,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794038.0, ans=0.125 2023-10-11 18:27:25,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=794038.0, ans=0.0 2023-10-11 18:27:36,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.730e+02 2.013e+02 2.258e+02 3.283e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-11 18:27:57,215 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.64 vs. limit=15.0 2023-10-11 18:28:06,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=794224.6666666666, ans=0.2 2023-10-11 18:28:06,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.60 vs. limit=22.5 2023-10-11 18:28:49,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=794364.6666666666, ans=0.125 2023-10-11 18:29:27,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=794504.6666666666, ans=0.0 2023-10-11 18:29:37,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.694e+02 1.898e+02 2.109e+02 3.493e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-11 18:30:14,911 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:31:00,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=794924.6666666666, ans=0.1 2023-10-11 18:31:12,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794971.3333333334, ans=0.125 2023-10-11 18:31:20,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=794971.3333333334, ans=0.2 2023-10-11 18:31:23,199 INFO [train.py:1031] (3/4) Epoch 13, batch 6500, loss[loss=0.2025, simple_loss=0.2687, pruned_loss=0.06817, over 12410.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2892, pruned_loss=0.05575, over 31506059.61 frames. ], batch size: 440, lr: 2.72e-03, grad_scale: 32.0 2023-10-11 18:31:28,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795018.0, ans=0.125 2023-10-11 18:31:30,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.734e+02 1.911e+02 2.094e+02 2.626e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-11 18:31:37,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.85 vs. limit=22.5 2023-10-11 18:31:46,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-10-11 18:32:22,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-11 18:32:25,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=795204.6666666666, ans=0.125 2023-10-11 18:32:28,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=22.5 2023-10-11 18:32:30,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.75 vs. limit=15.0 2023-10-11 18:32:32,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795251.3333333334, ans=0.125 2023-10-11 18:32:51,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=795298.0, ans=0.125 2023-10-11 18:33:05,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.74 vs. limit=22.5 2023-10-11 18:33:09,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=795391.3333333334, ans=0.0 2023-10-11 18:33:20,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=795438.0, ans=0.125 2023-10-11 18:33:23,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=795438.0, ans=0.125 2023-10-11 18:33:25,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=795438.0, ans=0.0 2023-10-11 18:33:26,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=795484.6666666666, ans=0.0 2023-10-11 18:33:33,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.734e+02 1.871e+02 2.130e+02 2.932e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 18:33:56,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=795578.0, ans=0.2 2023-10-11 18:34:01,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=795624.6666666666, ans=0.1 2023-10-11 18:34:22,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=795718.0, ans=10.0 2023-10-11 18:34:30,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=795764.6666666666, ans=0.95 2023-10-11 18:34:43,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=795811.3333333334, ans=0.0 2023-10-11 18:34:48,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=795811.3333333334, ans=0.0 2023-10-11 18:35:01,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=795858.0, ans=0.125 2023-10-11 18:35:02,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=795858.0, ans=0.125 2023-10-11 18:35:11,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=795904.6666666666, ans=0.035 2023-10-11 18:35:15,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=795951.3333333334, ans=0.125 2023-10-11 18:35:20,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=795951.3333333334, ans=0.0 2023-10-11 18:35:21,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.684e+02 1.938e+02 2.297e+02 3.586e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-11 18:35:25,279 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:35:32,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=795998.0, ans=0.125 2023-10-11 18:35:37,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=796044.6666666666, ans=0.0 2023-10-11 18:36:10,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=796138.0, ans=0.125 2023-10-11 18:36:30,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.29 vs. limit=10.0 2023-10-11 18:36:34,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=22.5 2023-10-11 18:37:25,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.83 vs. limit=15.0 2023-10-11 18:37:27,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.607e+02 1.843e+02 2.150e+02 3.395e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-11 18:38:03,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=796558.0, ans=0.05 2023-10-11 18:38:05,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=796558.0, ans=0.125 2023-10-11 18:38:10,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=796558.0, ans=0.125 2023-10-11 18:38:11,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=796558.0, ans=0.05 2023-10-11 18:38:27,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-11 18:38:44,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=796698.0, ans=0.2 2023-10-11 18:38:50,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=796744.6666666666, ans=0.1 2023-10-11 18:38:54,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=796744.6666666666, ans=0.0 2023-10-11 18:39:05,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=796791.3333333334, ans=0.0 2023-10-11 18:39:28,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.621e+02 1.793e+02 2.030e+02 2.953e+02, threshold=3.585e+02, percent-clipped=0.0 2023-10-11 18:39:30,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.35 vs. limit=15.0 2023-10-11 18:39:47,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=796978.0, ans=0.0 2023-10-11 18:39:52,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=796978.0, ans=0.2 2023-10-11 18:40:09,146 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.81 vs. limit=22.5 2023-10-11 18:40:09,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797071.3333333334, ans=0.125 2023-10-11 18:40:12,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2023-10-11 18:40:19,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=797118.0, ans=0.125 2023-10-11 18:40:31,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=797164.6666666666, ans=0.2 2023-10-11 18:40:40,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=797211.3333333334, ans=0.125 2023-10-11 18:40:40,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.70 vs. limit=22.5 2023-10-11 18:40:44,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.18 vs. limit=10.0 2023-10-11 18:40:49,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=797258.0, ans=0.0 2023-10-11 18:40:50,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=15.0 2023-10-11 18:41:00,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.51 vs. limit=10.0 2023-10-11 18:41:07,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=797304.6666666666, ans=0.0 2023-10-11 18:41:08,962 INFO [train.py:1031] (3/4) Epoch 13, batch 7000, loss[loss=0.1951, simple_loss=0.2889, pruned_loss=0.05065, over 16913.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2897, pruned_loss=0.05568, over 31804107.40 frames. ], batch size: 93, lr: 2.72e-03, grad_scale: 32.0 2023-10-11 18:41:15,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.765e+02 1.918e+02 2.132e+02 3.036e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-11 18:41:19,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797398.0, ans=0.125 2023-10-11 18:41:19,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=797398.0, ans=0.125 2023-10-11 18:41:28,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=797398.0, ans=0.2 2023-10-11 18:41:43,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=797444.6666666666, ans=0.125 2023-10-11 18:41:48,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.10 vs. limit=22.5 2023-10-11 18:41:53,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=797491.3333333334, ans=0.1 2023-10-11 18:41:58,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-10-11 18:42:16,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=797584.6666666666, ans=0.0 2023-10-11 18:42:29,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=797678.0, ans=0.125 2023-10-11 18:42:32,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=797678.0, ans=0.2 2023-10-11 18:42:36,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=797678.0, ans=0.0 2023-10-11 18:42:41,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=797724.6666666666, ans=0.2 2023-10-11 18:42:42,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=797724.6666666666, ans=0.125 2023-10-11 18:42:42,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=797724.6666666666, ans=0.125 2023-10-11 18:42:44,084 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.324e-02 2023-10-11 18:42:47,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=797724.6666666666, ans=0.0 2023-10-11 18:42:58,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=797818.0, ans=0.0 2023-10-11 18:43:05,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.737e+02 1.824e+02 2.090e+02 3.041e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-11 18:43:13,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2023-10-11 18:43:16,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=12.0 2023-10-11 18:43:30,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=797911.3333333334, ans=0.125 2023-10-11 18:44:31,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=798191.3333333334, ans=0.1 2023-10-11 18:44:34,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=798191.3333333334, ans=0.2 2023-10-11 18:44:48,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=15.0 2023-10-11 18:44:59,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.711e+02 1.880e+02 2.094e+02 3.305e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-11 18:45:39,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=798424.6666666666, ans=0.1 2023-10-11 18:46:02,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=798518.0, ans=0.125 2023-10-11 18:46:19,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=798564.6666666666, ans=0.2 2023-10-11 18:46:27,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=798611.3333333334, ans=0.125 2023-10-11 18:46:33,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=798611.3333333334, ans=10.0 2023-10-11 18:46:36,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=22.5 2023-10-11 18:46:38,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.83 vs. limit=10.0 2023-10-11 18:46:42,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=798658.0, ans=0.0 2023-10-11 18:46:47,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=798704.6666666666, ans=0.2 2023-10-11 18:46:53,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=798704.6666666666, ans=0.1 2023-10-11 18:46:53,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=798704.6666666666, ans=0.125 2023-10-11 18:47:02,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.718e+02 1.864e+02 2.045e+02 2.728e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 18:47:05,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=798751.3333333334, ans=0.125 2023-10-11 18:47:07,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=798798.0, ans=0.125 2023-10-11 18:47:16,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=798798.0, ans=0.125 2023-10-11 18:47:56,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=798984.6666666666, ans=0.05 2023-10-11 18:47:57,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=798984.6666666666, ans=0.125 2023-10-11 18:48:17,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=799078.0, ans=0.1 2023-10-11 18:48:36,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-10-11 18:48:44,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=799171.3333333334, ans=0.125 2023-10-11 18:48:56,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.665e+02 1.833e+02 2.031e+02 2.802e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-11 18:48:58,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.58 vs. limit=15.0 2023-10-11 18:49:07,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.02 vs. limit=10.0 2023-10-11 18:49:12,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=799311.3333333334, ans=0.09899494936611666 2023-10-11 18:49:19,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=799311.3333333334, ans=0.0 2023-10-11 18:49:43,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=799451.3333333334, ans=0.125 2023-10-11 18:49:44,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=799451.3333333334, ans=0.125 2023-10-11 18:49:52,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=799451.3333333334, ans=0.125 2023-10-11 18:49:57,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=799498.0, ans=0.125 2023-10-11 18:50:08,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=799544.6666666666, ans=0.5 2023-10-11 18:50:12,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=15.0 2023-10-11 18:50:37,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-10-11 18:50:39,619 INFO [train.py:1031] (3/4) Epoch 13, batch 7500, loss[loss=0.2107, simple_loss=0.3017, pruned_loss=0.05978, over 16889.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2897, pruned_loss=0.05573, over 32035929.62 frames. ], batch size: 130, lr: 2.71e-03, grad_scale: 32.0 2023-10-11 18:50:45,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.793e+02 1.960e+02 2.204e+02 2.928e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-11 18:50:51,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=799731.3333333334, ans=0.2 2023-10-11 18:51:04,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=799778.0, ans=0.125 2023-10-11 18:51:06,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=799778.0, ans=0.0 2023-10-11 18:51:16,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799824.6666666666, ans=0.1 2023-10-11 18:51:26,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=799871.3333333334, ans=0.125 2023-10-11 18:51:29,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.29 vs. limit=15.0 2023-10-11 18:51:36,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=799918.0, ans=0.2 2023-10-11 18:51:46,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=799964.6666666666, ans=0.2 2023-10-11 18:52:07,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=800058.0, ans=0.0 2023-10-11 18:52:30,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=800151.3333333334, ans=0.125 2023-10-11 18:52:35,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.689e+02 1.840e+02 2.124e+02 2.979e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-11 18:52:57,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=800244.6666666666, ans=0.125 2023-10-11 18:53:14,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=800291.3333333334, ans=0.0 2023-10-11 18:54:04,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=800478.0, ans=0.0 2023-10-11 18:54:09,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-10-11 18:54:17,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-10-11 18:54:17,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=800524.6666666666, ans=0.0 2023-10-11 18:54:39,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.724e+02 1.841e+02 2.058e+02 3.080e+02, threshold=3.682e+02, percent-clipped=0.0 2023-10-11 18:54:45,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800664.6666666666, ans=0.1 2023-10-11 18:54:46,493 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:54:48,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800664.6666666666, ans=0.1 2023-10-11 18:54:59,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=800711.3333333334, ans=0.125 2023-10-11 18:55:02,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=800711.3333333334, ans=0.2 2023-10-11 18:55:15,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800804.6666666666, ans=0.1 2023-10-11 18:55:17,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=800804.6666666666, ans=0.1 2023-10-11 18:55:25,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=800804.6666666666, ans=0.125 2023-10-11 18:55:31,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=800851.3333333334, ans=0.125 2023-10-11 18:55:32,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800851.3333333334, ans=0.1 2023-10-11 18:55:33,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=800851.3333333334, ans=0.0 2023-10-11 18:55:41,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=800898.0, ans=0.2 2023-10-11 18:55:43,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=800898.0, ans=0.0 2023-10-11 18:55:49,091 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2023-10-11 18:55:50,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=800944.6666666666, ans=22.5 2023-10-11 18:55:59,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=800991.3333333334, ans=0.0 2023-10-11 18:56:27,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.691e+02 1.939e+02 2.196e+02 3.394e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-11 18:56:49,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=801178.0, ans=0.125 2023-10-11 18:57:01,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=801224.6666666666, ans=0.0 2023-10-11 18:57:07,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=801224.6666666666, ans=0.1 2023-10-11 18:57:14,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=801271.3333333334, ans=0.0 2023-10-11 18:57:23,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=801318.0, ans=0.2 2023-10-11 18:57:42,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=801364.6666666666, ans=0.2 2023-10-11 18:57:45,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.64 vs. limit=10.0 2023-10-11 18:57:49,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-10-11 18:57:50,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801411.3333333334, ans=0.1 2023-10-11 18:57:53,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=801458.0, ans=0.2 2023-10-11 18:58:17,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=801551.3333333334, ans=0.125 2023-10-11 18:58:27,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.689e+02 1.833e+02 2.023e+02 2.910e+02, threshold=3.665e+02, percent-clipped=0.0 2023-10-11 18:58:30,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=12.0 2023-10-11 18:58:35,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=801598.0, ans=0.0 2023-10-11 18:58:35,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=12.0 2023-10-11 18:59:02,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=801691.3333333334, ans=0.2 2023-10-11 18:59:02,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=801691.3333333334, ans=0.125 2023-10-11 18:59:04,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-10-11 18:59:06,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=801738.0, ans=0.04949747468305833 2023-10-11 18:59:33,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-10-11 18:59:37,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=801831.3333333334, ans=0.0 2023-10-11 18:59:46,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=801878.0, ans=0.125 2023-10-11 18:59:50,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=801878.0, ans=0.125 2023-10-11 18:59:59,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-10-11 19:00:01,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=801924.6666666666, ans=15.0 2023-10-11 19:00:07,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=801971.3333333334, ans=0.05 2023-10-11 19:00:14,148 INFO [train.py:1031] (3/4) Epoch 13, batch 8000, loss[loss=0.2073, simple_loss=0.2772, pruned_loss=0.06874, over 15663.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.289, pruned_loss=0.0551, over 32212185.55 frames. ], batch size: 350, lr: 2.71e-03, grad_scale: 32.0 2023-10-11 19:00:21,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.650e+02 1.768e+02 2.048e+02 3.506e+02, threshold=3.537e+02, percent-clipped=0.0 2023-10-11 19:00:26,496 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:00:28,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=802064.6666666666, ans=0.0 2023-10-11 19:00:29,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=802064.6666666666, ans=0.2 2023-10-11 19:00:44,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=802158.0, ans=0.0 2023-10-11 19:00:51,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=802158.0, ans=0.0 2023-10-11 19:01:01,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.76 vs. limit=22.5 2023-10-11 19:01:01,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=802204.6666666666, ans=0.125 2023-10-11 19:01:11,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=802251.3333333334, ans=0.125 2023-10-11 19:01:20,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=802298.0, ans=0.0 2023-10-11 19:01:36,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-11 19:01:53,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=802438.0, ans=0.1 2023-10-11 19:02:03,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=802484.6666666666, ans=0.0 2023-10-11 19:02:04,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=802484.6666666666, ans=0.5 2023-10-11 19:02:05,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.652e+02 1.803e+02 2.042e+02 2.927e+02, threshold=3.605e+02, percent-clipped=0.0 2023-10-11 19:02:07,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=802484.6666666666, ans=0.2 2023-10-11 19:02:27,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=802578.0, ans=0.1 2023-10-11 19:03:32,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.63 vs. limit=22.5 2023-10-11 19:03:38,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=802811.3333333334, ans=0.0 2023-10-11 19:03:53,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=802858.0, ans=0.05 2023-10-11 19:04:13,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.620e+02 1.804e+02 2.092e+02 2.694e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-11 19:04:13,988 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:04:14,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=802951.3333333334, ans=0.2 2023-10-11 19:04:23,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=802998.0, ans=0.125 2023-10-11 19:04:43,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=803091.3333333334, ans=0.125 2023-10-11 19:04:45,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=803091.3333333334, ans=0.125 2023-10-11 19:05:03,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=12.0 2023-10-11 19:05:07,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.86 vs. limit=15.0 2023-10-11 19:05:17,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=803231.3333333334, ans=0.0 2023-10-11 19:05:35,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=803324.6666666666, ans=10.0 2023-10-11 19:05:36,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=803324.6666666666, ans=0.1 2023-10-11 19:05:41,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=803324.6666666666, ans=0.0 2023-10-11 19:05:42,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=803371.3333333334, ans=15.0 2023-10-11 19:06:02,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=803418.0, ans=0.125 2023-10-11 19:06:04,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.709e+02 1.898e+02 2.148e+02 2.911e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 19:06:12,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=803464.6666666666, ans=0.125 2023-10-11 19:06:22,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=803511.3333333334, ans=0.125 2023-10-11 19:06:30,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=803558.0, ans=0.125 2023-10-11 19:06:50,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=803604.6666666666, ans=0.125 2023-10-11 19:06:54,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=803651.3333333334, ans=0.2 2023-10-11 19:06:58,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=803651.3333333334, ans=0.125 2023-10-11 19:07:03,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=803698.0, ans=0.0 2023-10-11 19:07:12,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=803698.0, ans=0.0 2023-10-11 19:07:14,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=803698.0, ans=0.2 2023-10-11 19:07:16,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=803744.6666666666, ans=15.0 2023-10-11 19:07:17,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=803744.6666666666, ans=0.0 2023-10-11 19:07:32,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2023-10-11 19:08:00,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.699e+02 1.900e+02 2.073e+02 4.054e+02, threshold=3.800e+02, percent-clipped=1.0 2023-10-11 19:08:09,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=803931.3333333334, ans=0.125 2023-10-11 19:08:21,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=15.0 2023-10-11 19:08:32,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=804024.6666666666, ans=0.0 2023-10-11 19:08:45,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=804071.3333333334, ans=0.125 2023-10-11 19:08:46,579 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:08:48,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=804118.0, ans=0.2 2023-10-11 19:08:53,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=804118.0, ans=0.2 2023-10-11 19:09:14,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=804211.3333333334, ans=0.125 2023-10-11 19:09:33,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=804258.0, ans=0.125 2023-10-11 19:09:50,950 INFO [train.py:1031] (3/4) Epoch 13, batch 8500, loss[loss=0.2183, simple_loss=0.302, pruned_loss=0.06734, over 16504.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2892, pruned_loss=0.055, over 32355916.87 frames. ], batch size: 266, lr: 2.70e-03, grad_scale: 32.0 2023-10-11 19:09:59,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.731e+02 1.912e+02 2.111e+02 2.937e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-11 19:10:00,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=804351.3333333334, ans=0.1 2023-10-11 19:10:17,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=804444.6666666666, ans=0.04949747468305833 2023-10-11 19:10:25,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=804491.3333333334, ans=0.0 2023-10-11 19:10:27,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.34 vs. limit=10.0 2023-10-11 19:10:44,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=804538.0, ans=0.125 2023-10-11 19:11:06,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=804631.3333333334, ans=0.0 2023-10-11 19:11:16,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=804678.0, ans=0.0 2023-10-11 19:11:57,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=804818.0, ans=0.125 2023-10-11 19:12:02,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.776e+02 2.022e+02 2.289e+02 3.942e+02, threshold=4.044e+02, percent-clipped=1.0 2023-10-11 19:12:21,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=804911.3333333334, ans=0.1 2023-10-11 19:12:29,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-11 19:12:34,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=804958.0, ans=0.2 2023-10-11 19:12:40,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=805004.6666666666, ans=0.2 2023-10-11 19:12:49,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.29 vs. limit=10.0 2023-10-11 19:13:19,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.18 vs. limit=15.0 2023-10-11 19:13:31,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805191.3333333334, ans=0.1 2023-10-11 19:13:47,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=805238.0, ans=0.125 2023-10-11 19:14:05,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.566e+02 1.709e+02 1.961e+02 2.849e+02, threshold=3.417e+02, percent-clipped=0.0 2023-10-11 19:14:26,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=805378.0, ans=0.025 2023-10-11 19:14:38,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=805424.6666666666, ans=0.125 2023-10-11 19:14:51,272 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:15:10,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=805564.6666666666, ans=0.0 2023-10-11 19:15:15,727 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:15:15,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=805564.6666666666, ans=0.125 2023-10-11 19:15:32,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=805611.3333333334, ans=0.0 2023-10-11 19:15:37,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=805658.0, ans=10.0 2023-10-11 19:15:44,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=805658.0, ans=0.0 2023-10-11 19:15:49,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805704.6666666666, ans=0.1 2023-10-11 19:15:51,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=805704.6666666666, ans=0.1 2023-10-11 19:15:54,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=805704.6666666666, ans=0.125 2023-10-11 19:15:58,047 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:16:08,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.635e+02 1.806e+02 2.047e+02 3.085e+02, threshold=3.611e+02, percent-clipped=0.0 2023-10-11 19:16:16,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=805798.0, ans=0.0 2023-10-11 19:16:49,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=805938.0, ans=0.0 2023-10-11 19:17:04,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=806031.3333333334, ans=0.0 2023-10-11 19:17:07,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=806031.3333333334, ans=0.0 2023-10-11 19:17:19,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=806078.0, ans=10.0 2023-10-11 19:17:19,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=806078.0, ans=0.125 2023-10-11 19:17:52,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=806218.0, ans=0.0 2023-10-11 19:17:57,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.688e+02 1.885e+02 2.093e+02 3.354e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-11 19:18:06,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=806264.6666666666, ans=0.125 2023-10-11 19:18:11,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=12.0 2023-10-11 19:18:16,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=806311.3333333334, ans=0.125 2023-10-11 19:18:45,279 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:18:46,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=806451.3333333334, ans=0.125 2023-10-11 19:19:18,157 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:19:39,333 INFO [train.py:1031] (3/4) Epoch 13, batch 9000, loss[loss=0.1945, simple_loss=0.2763, pruned_loss=0.05637, over 16324.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2885, pruned_loss=0.05472, over 32447768.41 frames. ], batch size: 44, lr: 2.70e-03, grad_scale: 32.0 2023-10-11 19:19:41,765 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2023-10-11 19:19:48,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.703e+02 1.868e+02 2.088e+02 3.549e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 19:19:55,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=806731.3333333334, ans=0.5 2023-10-11 19:19:57,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=22.5 2023-10-11 19:19:58,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.50 vs. limit=12.0 2023-10-11 19:20:04,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=806778.0, ans=0.125 2023-10-11 19:20:09,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=806778.0, ans=0.125 2023-10-11 19:20:21,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=806824.6666666666, ans=0.125 2023-10-11 19:20:40,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=806918.0, ans=0.1 2023-10-11 19:20:59,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=807011.3333333334, ans=0.125 2023-10-11 19:21:04,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=807011.3333333334, ans=0.1 2023-10-11 19:21:11,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=807058.0, ans=0.125 2023-10-11 19:21:38,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.616e+02 1.805e+02 2.034e+02 2.671e+02, threshold=3.610e+02, percent-clipped=0.0 2023-10-11 19:21:58,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.80 vs. limit=10.0 2023-10-11 19:22:15,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.71 vs. limit=15.0 2023-10-11 19:22:35,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=807431.3333333334, ans=0.2 2023-10-11 19:22:36,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=807431.3333333334, ans=0.0 2023-10-11 19:22:43,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=807431.3333333334, ans=0.125 2023-10-11 19:22:49,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=807478.0, ans=0.0 2023-10-11 19:22:58,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=807524.6666666666, ans=0.2 2023-10-11 19:23:14,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=807571.3333333334, ans=0.2 2023-10-11 19:23:23,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=807618.0, ans=0.125 2023-10-11 19:23:27,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.775e+02 1.946e+02 2.169e+02 3.183e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-11 19:23:30,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=807664.6666666666, ans=0.2 2023-10-11 19:23:32,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.18 vs. limit=10.0 2023-10-11 19:23:48,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=807711.3333333334, ans=0.0 2023-10-11 19:24:19,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=807851.3333333334, ans=0.125 2023-10-11 19:24:21,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=807851.3333333334, ans=0.0 2023-10-11 19:24:24,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=807898.0, ans=0.0 2023-10-11 19:24:27,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=807898.0, ans=0.125 2023-10-11 19:24:29,551 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.64 vs. limit=15.0 2023-10-11 19:24:32,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.00 vs. limit=10.0 2023-10-11 19:24:37,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=807944.6666666666, ans=0.0 2023-10-11 19:24:49,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=807991.3333333334, ans=0.0 2023-10-11 19:24:54,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=808038.0, ans=0.1 2023-10-11 19:25:13,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.746e+02 1.947e+02 2.206e+02 3.240e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-11 19:25:30,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=808178.0, ans=0.125 2023-10-11 19:25:36,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=808178.0, ans=0.0 2023-10-11 19:25:46,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-10-11 19:26:00,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=808271.3333333334, ans=0.0 2023-10-11 19:26:19,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808364.6666666666, ans=0.1 2023-10-11 19:26:45,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=808458.0, ans=0.125 2023-10-11 19:27:02,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.27 vs. limit=22.5 2023-10-11 19:27:09,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=808551.3333333334, ans=0.125 2023-10-11 19:27:14,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.891e+02 2.171e+02 2.514e+02 3.430e+02, threshold=4.342e+02, percent-clipped=0.0 2023-10-11 19:27:27,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=808644.6666666666, ans=0.0 2023-10-11 19:27:36,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.56 vs. limit=15.0 2023-10-11 19:27:37,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=808644.6666666666, ans=0.125 2023-10-11 19:27:46,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=808691.3333333334, ans=0.125 2023-10-11 19:27:57,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=808738.0, ans=0.125 2023-10-11 19:27:57,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=15.0 2023-10-11 19:27:59,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=808738.0, ans=0.0 2023-10-11 19:27:59,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=808738.0, ans=0.125 2023-10-11 19:28:14,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=808784.6666666666, ans=0.0 2023-10-11 19:28:16,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=808831.3333333334, ans=0.0 2023-10-11 19:28:16,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=808831.3333333334, ans=0.0 2023-10-11 19:28:25,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=808831.3333333334, ans=0.09899494936611666 2023-10-11 19:28:31,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=808878.0, ans=0.125 2023-10-11 19:28:35,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808878.0, ans=0.1 2023-10-11 19:28:35,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.68 vs. limit=22.5 2023-10-11 19:29:06,697 INFO [train.py:1031] (3/4) Epoch 13, batch 9500, loss[loss=0.1838, simple_loss=0.2751, pruned_loss=0.0462, over 16864.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2893, pruned_loss=0.05499, over 32534889.60 frames. ], batch size: 72, lr: 2.70e-03, grad_scale: 32.0 2023-10-11 19:29:10,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=809018.0, ans=0.125 2023-10-11 19:29:15,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.709e+02 1.826e+02 1.980e+02 2.687e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 19:29:23,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=809064.6666666666, ans=0.125 2023-10-11 19:29:33,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=809111.3333333334, ans=0.0 2023-10-11 19:29:37,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=12.0 2023-10-11 19:29:38,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809158.0, ans=0.1 2023-10-11 19:29:41,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=809158.0, ans=0.125 2023-10-11 19:30:06,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809251.3333333334, ans=0.1 2023-10-11 19:30:11,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=809251.3333333334, ans=0.0 2023-10-11 19:30:13,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=809298.0, ans=0.125 2023-10-11 19:30:25,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=809344.6666666666, ans=0.125 2023-10-11 19:30:33,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=809344.6666666666, ans=0.5 2023-10-11 19:30:38,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809391.3333333334, ans=0.1 2023-10-11 19:31:03,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=809484.6666666666, ans=0.0 2023-10-11 19:31:08,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.694e+02 1.835e+02 2.088e+02 3.171e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-11 19:31:16,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-11 19:31:33,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=809624.6666666666, ans=0.2 2023-10-11 19:32:08,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.51 vs. limit=22.5 2023-10-11 19:32:09,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=809764.6666666666, ans=0.0 2023-10-11 19:32:22,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-10-11 19:32:27,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=809811.3333333334, ans=0.0 2023-10-11 19:32:32,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=809858.0, ans=0.0 2023-10-11 19:32:41,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=809858.0, ans=0.125 2023-10-11 19:32:43,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-10-11 19:32:53,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=809951.3333333334, ans=0.125 2023-10-11 19:33:03,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.719e+02 1.977e+02 2.268e+02 3.748e+02, threshold=3.954e+02, percent-clipped=1.0 2023-10-11 19:33:09,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=809998.0, ans=0.1 2023-10-11 19:33:30,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=810091.3333333334, ans=0.125 2023-10-11 19:33:32,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=810091.3333333334, ans=0.125 2023-10-11 19:33:48,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=810184.6666666666, ans=0.0 2023-10-11 19:33:52,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=810184.6666666666, ans=0.0 2023-10-11 19:34:06,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=810231.3333333334, ans=0.125 2023-10-11 19:34:22,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=810324.6666666666, ans=0.1 2023-10-11 19:34:23,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=810324.6666666666, ans=0.0 2023-10-11 19:34:54,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.763e+02 1.991e+02 2.263e+02 3.655e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-11 19:35:32,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=810604.6666666666, ans=0.0 2023-10-11 19:36:05,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=810744.6666666666, ans=0.0 2023-10-11 19:36:09,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=810744.6666666666, ans=0.2 2023-10-11 19:36:11,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-10-11 19:36:15,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=810791.3333333334, ans=0.125 2023-10-11 19:36:17,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=810791.3333333334, ans=0.125 2023-10-11 19:36:34,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.00 vs. limit=15.0 2023-10-11 19:36:37,324 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:36:47,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.678e+02 1.805e+02 1.974e+02 2.361e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-11 19:37:01,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=810978.0, ans=0.07 2023-10-11 19:37:07,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=810978.0, ans=0.05 2023-10-11 19:37:11,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=15.0 2023-10-11 19:37:19,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=811024.6666666666, ans=0.1 2023-10-11 19:37:19,790 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2023-10-11 19:37:33,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=811071.3333333334, ans=0.1 2023-10-11 19:37:48,672 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:37:57,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=811211.3333333334, ans=0.0 2023-10-11 19:37:58,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=811211.3333333334, ans=0.125 2023-10-11 19:38:01,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=811211.3333333334, ans=0.2 2023-10-11 19:38:02,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811211.3333333334, ans=0.1 2023-10-11 19:38:15,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=811258.0, ans=0.07 2023-10-11 19:38:16,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=811258.0, ans=0.125 2023-10-11 19:38:28,932 INFO [train.py:1031] (3/4) Epoch 13, batch 10000, loss[loss=0.1791, simple_loss=0.2686, pruned_loss=0.04486, over 16655.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2883, pruned_loss=0.0547, over 32548068.27 frames. ], batch size: 56, lr: 2.69e-03, grad_scale: 32.0 2023-10-11 19:38:30,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=811351.3333333334, ans=0.125 2023-10-11 19:38:30,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=811351.3333333334, ans=0.2 2023-10-11 19:38:32,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=811351.3333333334, ans=0.2 2023-10-11 19:38:37,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.719e+02 1.922e+02 2.121e+02 2.805e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-11 19:38:40,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=811398.0, ans=0.125 2023-10-11 19:38:51,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=811444.6666666666, ans=0.0 2023-10-11 19:38:52,304 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.59 vs. limit=10.0 2023-10-11 19:39:03,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=811491.3333333334, ans=0.0 2023-10-11 19:39:08,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=811491.3333333334, ans=0.125 2023-10-11 19:39:14,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.29 vs. limit=10.0 2023-10-11 19:39:34,614 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=8.0 2023-10-11 19:39:45,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=811678.0, ans=0.0 2023-10-11 19:40:00,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=811724.6666666666, ans=0.125 2023-10-11 19:40:01,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811724.6666666666, ans=0.1 2023-10-11 19:40:08,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=811771.3333333334, ans=0.025 2023-10-11 19:40:10,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=811771.3333333334, ans=0.0 2023-10-11 19:40:13,146 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2023-10-11 19:40:17,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=811818.0, ans=0.0 2023-10-11 19:40:24,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=811818.0, ans=10.0 2023-10-11 19:40:24,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=811818.0, ans=0.07 2023-10-11 19:40:26,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.44 vs. limit=10.0 2023-10-11 19:40:29,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.678e+02 1.841e+02 2.048e+02 3.171e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-11 19:40:38,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-11 19:40:41,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=811911.3333333334, ans=0.125 2023-10-11 19:40:48,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=811911.3333333334, ans=0.125 2023-10-11 19:40:55,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.89 vs. limit=15.0 2023-10-11 19:40:59,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=811958.0, ans=0.125 2023-10-11 19:41:23,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=812051.3333333334, ans=0.0 2023-10-11 19:41:30,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=812098.0, ans=0.0 2023-10-11 19:41:35,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=812098.0, ans=0.0 2023-10-11 19:41:43,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=812144.6666666666, ans=0.0 2023-10-11 19:41:50,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=812191.3333333334, ans=0.07 2023-10-11 19:42:12,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=812284.6666666666, ans=0.0 2023-10-11 19:42:27,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.714e+02 1.932e+02 2.217e+02 2.985e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-11 19:42:43,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-11 19:42:47,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=812378.0, ans=0.125 2023-10-11 19:42:48,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=812378.0, ans=0.2 2023-10-11 19:43:10,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-10-11 19:43:25,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.46 vs. limit=15.0 2023-10-11 19:43:27,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=812564.6666666666, ans=0.0 2023-10-11 19:43:50,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=812658.0, ans=0.125 2023-10-11 19:43:52,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.88 vs. limit=15.0 2023-10-11 19:43:52,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=812658.0, ans=0.0 2023-10-11 19:44:03,791 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:44:09,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-11 19:44:16,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.14 vs. limit=22.5 2023-10-11 19:44:18,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=812751.3333333334, ans=0.2 2023-10-11 19:44:23,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.672e+02 1.839e+02 2.092e+02 3.152e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-11 19:44:47,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2023-10-11 19:45:23,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813031.3333333334, ans=0.1 2023-10-11 19:45:38,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=813078.0, ans=0.0 2023-10-11 19:45:58,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-10-11 19:46:00,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=813171.3333333334, ans=0.125 2023-10-11 19:46:19,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.638e+02 1.760e+02 2.053e+02 3.312e+02, threshold=3.521e+02, percent-clipped=0.0 2023-10-11 19:46:31,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=813311.3333333334, ans=0.125 2023-10-11 19:46:57,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=813404.6666666666, ans=10.0 2023-10-11 19:47:01,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=813404.6666666666, ans=0.0 2023-10-11 19:47:16,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.22 vs. limit=15.0 2023-10-11 19:47:17,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=813498.0, ans=0.125 2023-10-11 19:47:29,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=813544.6666666666, ans=0.125 2023-10-11 19:47:48,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=813591.3333333334, ans=0.0 2023-10-11 19:47:48,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=813591.3333333334, ans=0.0 2023-10-11 19:47:58,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=813638.0, ans=0.125 2023-10-11 19:48:01,113 INFO [train.py:1031] (3/4) Epoch 13, batch 10500, loss[loss=0.1918, simple_loss=0.2883, pruned_loss=0.04763, over 16901.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2889, pruned_loss=0.05483, over 32612586.61 frames. ], batch size: 82, lr: 2.69e-03, grad_scale: 32.0 2023-10-11 19:48:08,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=813684.6666666666, ans=0.125 2023-10-11 19:48:09,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813684.6666666666, ans=0.1 2023-10-11 19:48:11,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.723e+02 1.882e+02 2.233e+02 3.328e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 19:48:21,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-10-11 19:48:32,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=813824.6666666666, ans=0.0 2023-10-11 19:48:51,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=813871.3333333334, ans=0.125 2023-10-11 19:48:55,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=813918.0, ans=0.125 2023-10-11 19:49:42,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=814058.0, ans=0.0 2023-10-11 19:49:44,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.91 vs. limit=10.0 2023-10-11 19:50:05,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814151.3333333334, ans=0.1 2023-10-11 19:50:07,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=814151.3333333334, ans=0.0 2023-10-11 19:50:09,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=814151.3333333334, ans=0.125 2023-10-11 19:50:16,059 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.703e+02 1.945e+02 2.213e+02 3.175e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-11 19:50:21,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=814198.0, ans=0.125 2023-10-11 19:50:23,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=814198.0, ans=0.125 2023-10-11 19:50:23,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=814198.0, ans=0.125 2023-10-11 19:50:24,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.79 vs. limit=22.5 2023-10-11 19:50:38,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=814291.3333333334, ans=0.125 2023-10-11 19:50:53,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=814338.0, ans=0.0 2023-10-11 19:50:58,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=814338.0, ans=0.2 2023-10-11 19:51:04,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=814384.6666666666, ans=0.0 2023-10-11 19:51:07,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=814384.6666666666, ans=0.125 2023-10-11 19:51:14,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.99 vs. limit=10.0 2023-10-11 19:51:29,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-10-11 19:51:40,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=814524.6666666666, ans=0.0 2023-10-11 19:51:44,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=814524.6666666666, ans=0.2 2023-10-11 19:51:46,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=814524.6666666666, ans=0.2 2023-10-11 19:52:12,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.712e+02 1.994e+02 2.267e+02 2.917e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-11 19:52:12,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=814664.6666666666, ans=0.0 2023-10-11 19:52:21,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=814664.6666666666, ans=0.1 2023-10-11 19:52:23,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=814711.3333333334, ans=0.125 2023-10-11 19:52:44,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=814758.0, ans=0.1 2023-10-11 19:52:52,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2023-10-11 19:52:52,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=814804.6666666666, ans=0.2 2023-10-11 19:53:01,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=814851.3333333334, ans=0.035 2023-10-11 19:53:01,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=814851.3333333334, ans=0.0 2023-10-11 19:53:10,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=814898.0, ans=0.025 2023-10-11 19:53:15,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-10-11 19:53:23,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=814944.6666666666, ans=0.125 2023-10-11 19:53:33,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=814991.3333333334, ans=0.125 2023-10-11 19:53:34,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=814991.3333333334, ans=0.125 2023-10-11 19:53:43,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=814991.3333333334, ans=0.125 2023-10-11 19:53:56,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=815084.6666666666, ans=0.125 2023-10-11 19:54:07,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.801e+02 2.122e+02 2.356e+02 4.342e+02, threshold=4.243e+02, percent-clipped=1.0 2023-10-11 19:54:15,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=815131.3333333334, ans=0.0 2023-10-11 19:54:33,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2023-10-11 19:54:39,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815224.6666666666, ans=0.1 2023-10-11 19:54:40,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=815271.3333333334, ans=0.125 2023-10-11 19:54:48,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=815271.3333333334, ans=0.0 2023-10-11 19:54:52,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=815318.0, ans=0.025 2023-10-11 19:54:55,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=815318.0, ans=0.0 2023-10-11 19:55:46,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=815504.6666666666, ans=0.125 2023-10-11 19:55:46,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.46 vs. limit=22.5 2023-10-11 19:55:48,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=815551.3333333334, ans=0.125 2023-10-11 19:55:50,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=815551.3333333334, ans=0.125 2023-10-11 19:55:54,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=815551.3333333334, ans=0.0 2023-10-11 19:55:54,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=815551.3333333334, ans=0.125 2023-10-11 19:55:55,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-10-11 19:56:03,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.311e+02 1.624e+02 1.779e+02 1.972e+02 2.884e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-11 19:56:10,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=15.0 2023-10-11 19:56:31,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=815691.3333333334, ans=0.2 2023-10-11 19:56:44,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=815738.0, ans=0.125 2023-10-11 19:56:45,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-10-11 19:56:45,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=815738.0, ans=0.125 2023-10-11 19:56:55,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=815784.6666666666, ans=0.125 2023-10-11 19:56:56,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=815784.6666666666, ans=0.125 2023-10-11 19:57:09,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=815831.3333333334, ans=0.125 2023-10-11 19:57:15,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=815878.0, ans=0.05 2023-10-11 19:57:29,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=815924.6666666666, ans=0.125 2023-10-11 19:57:36,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=815971.3333333334, ans=0.09899494936611666 2023-10-11 19:57:45,398 INFO [train.py:1031] (3/4) Epoch 13, batch 11000, loss[loss=0.2204, simple_loss=0.3012, pruned_loss=0.06978, over 16619.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2888, pruned_loss=0.05483, over 32658219.37 frames. ], batch size: 56, lr: 2.68e-03, grad_scale: 32.0 2023-10-11 19:57:56,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.753e+02 1.916e+02 2.204e+02 3.323e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 19:58:27,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=816204.6666666666, ans=0.2 2023-10-11 19:58:29,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=816204.6666666666, ans=0.0 2023-10-11 19:58:35,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=816204.6666666666, ans=0.0 2023-10-11 19:59:04,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=816344.6666666666, ans=0.0 2023-10-11 19:59:12,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=816344.6666666666, ans=0.04949747468305833 2023-10-11 19:59:29,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=816438.0, ans=0.125 2023-10-11 19:59:31,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=816438.0, ans=0.125 2023-10-11 19:59:34,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-10-11 19:59:54,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.665e+02 1.857e+02 2.037e+02 2.752e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-11 20:00:09,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816578.0, ans=0.1 2023-10-11 20:00:11,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=816578.0, ans=0.025 2023-10-11 20:00:37,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=816671.3333333334, ans=0.125 2023-10-11 20:00:38,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.56 vs. limit=15.0 2023-10-11 20:00:48,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816718.0, ans=0.1 2023-10-11 20:01:00,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=816764.6666666666, ans=0.125 2023-10-11 20:01:10,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=816811.3333333334, ans=0.09899494936611666 2023-10-11 20:01:12,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.09 vs. limit=15.0 2023-10-11 20:01:15,857 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=9.123e-02 2023-10-11 20:01:20,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-11 20:01:36,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=816904.6666666666, ans=0.07 2023-10-11 20:01:36,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=816904.6666666666, ans=0.0 2023-10-11 20:01:52,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.619e+02 1.826e+02 2.052e+02 2.830e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 20:02:09,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=817044.6666666666, ans=0.125 2023-10-11 20:02:33,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=817138.0, ans=0.2 2023-10-11 20:03:09,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=817278.0, ans=0.1 2023-10-11 20:03:44,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.03 vs. limit=22.5 2023-10-11 20:03:51,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.668e+02 1.823e+02 2.086e+02 3.030e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-11 20:04:17,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-10-11 20:05:00,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.10 vs. limit=15.0 2023-10-11 20:05:02,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=817744.6666666666, ans=0.125 2023-10-11 20:05:07,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=15.0 2023-10-11 20:05:11,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-10-11 20:05:25,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.10 vs. limit=15.0 2023-10-11 20:05:31,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=817838.0, ans=0.1 2023-10-11 20:05:45,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.750e+02 1.871e+02 2.135e+02 2.722e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 20:05:55,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=817978.0, ans=0.125 2023-10-11 20:06:11,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=818024.6666666666, ans=0.2 2023-10-11 20:06:38,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-10-11 20:06:45,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=818164.6666666666, ans=0.125 2023-10-11 20:06:52,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=818164.6666666666, ans=0.2 2023-10-11 20:07:00,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=818211.3333333334, ans=0.125 2023-10-11 20:07:03,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=818211.3333333334, ans=0.125 2023-10-11 20:07:21,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=818304.6666666666, ans=0.125 2023-10-11 20:07:30,165 INFO [train.py:1031] (3/4) Epoch 13, batch 11500, loss[loss=0.2146, simple_loss=0.3091, pruned_loss=0.06009, over 16588.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2888, pruned_loss=0.05476, over 32719835.82 frames. ], batch size: 219, lr: 2.68e-03, grad_scale: 32.0 2023-10-11 20:07:35,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=818351.3333333334, ans=0.125 2023-10-11 20:07:40,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.851e+02 2.098e+02 2.486e+02 3.679e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-11 20:07:44,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=818398.0, ans=0.125 2023-10-11 20:07:51,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=12.0 2023-10-11 20:08:32,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=818584.6666666666, ans=0.5 2023-10-11 20:08:35,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=818584.6666666666, ans=0.5 2023-10-11 20:08:35,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=818584.6666666666, ans=0.125 2023-10-11 20:08:40,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=818631.3333333334, ans=0.125 2023-10-11 20:08:40,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-10-11 20:08:48,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-10-11 20:08:59,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=15.0 2023-10-11 20:09:10,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.82 vs. limit=22.5 2023-10-11 20:09:14,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=818724.6666666666, ans=0.07 2023-10-11 20:09:16,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=818724.6666666666, ans=0.0 2023-10-11 20:09:25,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=818771.3333333334, ans=0.09899494936611666 2023-10-11 20:09:30,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=818771.3333333334, ans=0.125 2023-10-11 20:09:45,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.646e+02 1.816e+02 2.069e+02 3.193e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-11 20:09:55,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=818911.3333333334, ans=0.1 2023-10-11 20:10:02,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=818911.3333333334, ans=0.125 2023-10-11 20:10:13,220 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:10:32,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=819051.3333333334, ans=0.125 2023-10-11 20:10:56,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-11 20:11:25,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=819284.6666666666, ans=0.0 2023-10-11 20:11:33,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.286e+02 1.663e+02 1.794e+02 2.026e+02 2.722e+02, threshold=3.588e+02, percent-clipped=0.0 2023-10-11 20:11:35,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=819331.3333333334, ans=0.125 2023-10-11 20:11:36,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=819331.3333333334, ans=0.0 2023-10-11 20:11:47,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=819378.0, ans=0.1 2023-10-11 20:11:53,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=819378.0, ans=0.1 2023-10-11 20:11:58,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=819424.6666666666, ans=0.125 2023-10-11 20:11:58,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=819424.6666666666, ans=0.125 2023-10-11 20:11:58,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=819424.6666666666, ans=0.125 2023-10-11 20:12:07,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=819471.3333333334, ans=0.1 2023-10-11 20:12:13,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=819471.3333333334, ans=0.0 2023-10-11 20:12:41,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=819564.6666666666, ans=0.0 2023-10-11 20:13:10,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=819658.0, ans=0.0 2023-10-11 20:13:37,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.639e+02 1.828e+02 2.073e+02 2.770e+02, threshold=3.657e+02, percent-clipped=0.0 2023-10-11 20:13:37,837 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.86 vs. limit=10.0 2023-10-11 20:14:04,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=819891.3333333334, ans=0.125 2023-10-11 20:14:07,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=819891.3333333334, ans=0.1 2023-10-11 20:14:08,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=819891.3333333334, ans=0.0 2023-10-11 20:14:35,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=819984.6666666666, ans=0.2 2023-10-11 20:14:35,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=819984.6666666666, ans=0.2 2023-10-11 20:15:01,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=820124.6666666666, ans=0.125 2023-10-11 20:15:26,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=820171.3333333334, ans=0.125 2023-10-11 20:15:27,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=820218.0, ans=0.5 2023-10-11 20:15:39,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.751e+02 1.878e+02 2.168e+02 3.461e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-11 20:15:39,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=820264.6666666666, ans=0.125 2023-10-11 20:16:06,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=820358.0, ans=0.1 2023-10-11 20:16:10,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=820358.0, ans=0.2 2023-10-11 20:16:17,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=820404.6666666666, ans=0.2 2023-10-11 20:16:21,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=820404.6666666666, ans=0.0 2023-10-11 20:17:01,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=820591.3333333334, ans=0.0 2023-10-11 20:17:02,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=820591.3333333334, ans=0.125 2023-10-11 20:17:12,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=820638.0, ans=0.025 2023-10-11 20:17:21,794 INFO [train.py:1031] (3/4) Epoch 13, batch 12000, loss[loss=0.2127, simple_loss=0.302, pruned_loss=0.06177, over 16863.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2891, pruned_loss=0.05456, over 32778188.82 frames. ], batch size: 77, lr: 2.68e-03, grad_scale: 32.0 2023-10-11 20:17:34,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.652e+02 1.869e+02 2.166e+02 3.234e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 20:18:08,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=820871.3333333334, ans=0.125 2023-10-11 20:18:50,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=821011.3333333334, ans=0.05 2023-10-11 20:18:51,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821011.3333333334, ans=0.1 2023-10-11 20:19:19,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=821151.3333333334, ans=10.0 2023-10-11 20:19:32,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-10-11 20:19:35,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-10-11 20:19:36,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.593e+02 1.840e+02 2.000e+02 3.453e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-11 20:19:53,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=821244.6666666666, ans=0.125 2023-10-11 20:19:55,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=821244.6666666666, ans=0.125 2023-10-11 20:20:13,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=821338.0, ans=0.0 2023-10-11 20:20:13,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=821338.0, ans=0.2 2023-10-11 20:20:18,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-10-11 20:20:23,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821384.6666666666, ans=0.1 2023-10-11 20:20:29,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=821384.6666666666, ans=0.0 2023-10-11 20:20:43,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=821478.0, ans=0.125 2023-10-11 20:21:22,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=821618.0, ans=0.2 2023-10-11 20:21:26,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.655e+02 1.843e+02 2.001e+02 2.814e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-11 20:21:30,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=821664.6666666666, ans=0.2 2023-10-11 20:21:33,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=821664.6666666666, ans=0.0 2023-10-11 20:21:36,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-10-11 20:21:39,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=821711.3333333334, ans=0.125 2023-10-11 20:21:49,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=821758.0, ans=0.125 2023-10-11 20:21:54,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=821758.0, ans=0.125 2023-10-11 20:22:15,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=821851.3333333334, ans=0.1 2023-10-11 20:22:29,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=821898.0, ans=0.09899494936611666 2023-10-11 20:22:39,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=821944.6666666666, ans=0.0 2023-10-11 20:23:02,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=822038.0, ans=0.125 2023-10-11 20:23:05,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=822038.0, ans=0.0 2023-10-11 20:23:05,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-10-11 20:23:06,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=822038.0, ans=0.0 2023-10-11 20:23:18,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.759e+02 2.000e+02 2.189e+02 2.781e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-11 20:23:41,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=822224.6666666666, ans=0.125 2023-10-11 20:23:44,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-10-11 20:23:46,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=822224.6666666666, ans=0.025 2023-10-11 20:24:48,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.90 vs. limit=12.0 2023-10-11 20:24:53,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=822504.6666666666, ans=0.125 2023-10-11 20:25:00,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=822504.6666666666, ans=0.0 2023-10-11 20:25:15,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=822551.3333333334, ans=0.2 2023-10-11 20:25:17,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.712e+02 1.845e+02 2.064e+02 2.666e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 20:25:17,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=822598.0, ans=0.125 2023-10-11 20:25:22,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=822598.0, ans=0.0 2023-10-11 20:25:29,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2023-10-11 20:25:41,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=822691.3333333334, ans=0.125 2023-10-11 20:26:22,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=822831.3333333334, ans=0.0 2023-10-11 20:26:32,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=822878.0, ans=0.125 2023-10-11 20:26:33,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=822878.0, ans=0.2 2023-10-11 20:26:37,820 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:26:49,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=22.5 2023-10-11 20:26:59,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=822971.3333333334, ans=0.0 2023-10-11 20:27:02,362 INFO [train.py:1031] (3/4) Epoch 13, batch 12500, loss[loss=0.1962, simple_loss=0.2896, pruned_loss=0.05138, over 16856.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2886, pruned_loss=0.05455, over 32762659.04 frames. ], batch size: 155, lr: 2.67e-03, grad_scale: 64.0 2023-10-11 20:27:14,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.690e+02 1.886e+02 2.076e+02 3.499e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-11 20:27:24,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=823111.3333333334, ans=0.125 2023-10-11 20:27:25,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=823111.3333333334, ans=0.0 2023-10-11 20:28:40,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-11 20:29:06,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.659e+02 1.791e+02 2.035e+02 4.092e+02, threshold=3.582e+02, percent-clipped=1.0 2023-10-11 20:29:32,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=823624.6666666666, ans=10.0 2023-10-11 20:29:35,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=823624.6666666666, ans=0.0 2023-10-11 20:29:57,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=823718.0, ans=0.0 2023-10-11 20:30:08,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=823764.6666666666, ans=0.125 2023-10-11 20:30:25,790 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.97 vs. limit=12.0 2023-10-11 20:30:33,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=823858.0, ans=0.0 2023-10-11 20:30:37,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=823904.6666666666, ans=0.5 2023-10-11 20:31:01,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-10-11 20:31:03,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.656e+02 1.828e+02 2.081e+02 3.632e+02, threshold=3.655e+02, percent-clipped=1.0 2023-10-11 20:31:13,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=824044.6666666666, ans=0.125 2023-10-11 20:31:19,216 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.531e-03 2023-10-11 20:31:30,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=824091.3333333334, ans=0.125 2023-10-11 20:31:32,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=824091.3333333334, ans=0.125 2023-10-11 20:31:54,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=824184.6666666666, ans=0.125 2023-10-11 20:32:21,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=824324.6666666666, ans=0.1 2023-10-11 20:32:26,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=824324.6666666666, ans=0.2 2023-10-11 20:32:36,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=824371.3333333334, ans=0.0 2023-10-11 20:32:40,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=824371.3333333334, ans=0.125 2023-10-11 20:32:47,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=824418.0, ans=0.0 2023-10-11 20:32:51,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=824418.0, ans=0.04949747468305833 2023-10-11 20:32:55,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.729e+02 1.860e+02 2.054e+02 3.071e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-11 20:33:02,874 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:33:10,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-10-11 20:33:51,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-10-11 20:34:05,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-11 20:34:08,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=824744.6666666666, ans=0.025 2023-10-11 20:34:11,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=824791.3333333334, ans=0.025 2023-10-11 20:34:11,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=824791.3333333334, ans=0.125 2023-10-11 20:34:24,903 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:34:25,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=824838.0, ans=0.07 2023-10-11 20:34:38,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=824884.6666666666, ans=0.125 2023-10-11 20:34:48,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.658e+02 1.901e+02 2.250e+02 3.227e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 20:34:50,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=22.5 2023-10-11 20:35:09,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=825024.6666666666, ans=0.125 2023-10-11 20:35:15,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=825024.6666666666, ans=0.1 2023-10-11 20:35:19,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825071.3333333334, ans=0.1 2023-10-11 20:35:29,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.40 vs. limit=10.0 2023-10-11 20:35:35,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=825118.0, ans=0.05 2023-10-11 20:35:39,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=825164.6666666666, ans=0.125 2023-10-11 20:35:46,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825164.6666666666, ans=0.1 2023-10-11 20:35:51,303 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:36:23,750 INFO [train.py:1031] (3/4) Epoch 13, batch 13000, loss[loss=0.2004, simple_loss=0.2925, pruned_loss=0.05418, over 16753.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2892, pruned_loss=0.05464, over 32773281.98 frames. ], batch size: 202, lr: 2.67e-03, grad_scale: 16.0 2023-10-11 20:36:33,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=825351.3333333334, ans=0.0 2023-10-11 20:36:38,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.701e+02 1.901e+02 2.210e+02 3.263e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 20:37:10,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=825491.3333333334, ans=0.125 2023-10-11 20:37:15,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=825538.0, ans=0.125 2023-10-11 20:37:33,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=825584.6666666666, ans=0.125 2023-10-11 20:37:51,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-10-11 20:37:57,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=825678.0, ans=0.125 2023-10-11 20:38:07,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=825724.6666666666, ans=15.0 2023-10-11 20:38:09,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=825724.6666666666, ans=0.2 2023-10-11 20:38:19,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=825771.3333333334, ans=0.0 2023-10-11 20:38:27,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.65 vs. limit=15.0 2023-10-11 20:38:34,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=825818.0, ans=0.0 2023-10-11 20:38:40,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.691e+02 1.888e+02 2.203e+02 3.476e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-11 20:39:08,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=825958.0, ans=0.125 2023-10-11 20:39:19,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=826004.6666666666, ans=0.125 2023-10-11 20:39:42,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-11 20:39:52,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=826144.6666666666, ans=0.125 2023-10-11 20:39:53,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=826144.6666666666, ans=0.0 2023-10-11 20:40:09,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=826191.3333333334, ans=0.0 2023-10-11 20:40:11,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=826191.3333333334, ans=0.0 2023-10-11 20:40:17,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=826238.0, ans=0.125 2023-10-11 20:40:21,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=826238.0, ans=0.025 2023-10-11 20:40:29,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=826284.6666666666, ans=0.05 2023-10-11 20:40:37,767 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.735e+02 1.902e+02 2.150e+02 2.822e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 20:40:40,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.34 vs. limit=10.0 2023-10-11 20:40:47,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=826378.0, ans=0.125 2023-10-11 20:40:59,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-10-11 20:41:01,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.92 vs. limit=22.5 2023-10-11 20:41:29,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=826564.6666666666, ans=0.125 2023-10-11 20:41:36,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=15.0 2023-10-11 20:42:02,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2023-10-11 20:42:13,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=826704.6666666666, ans=0.2 2023-10-11 20:42:26,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=826798.0, ans=0.2 2023-10-11 20:42:28,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=826798.0, ans=15.0 2023-10-11 20:42:30,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.695e+02 1.849e+02 2.048e+02 2.577e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 20:42:33,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-10-11 20:42:57,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=826891.3333333334, ans=0.2 2023-10-11 20:42:58,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=826938.0, ans=0.125 2023-10-11 20:43:11,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=826984.6666666666, ans=0.0 2023-10-11 20:43:22,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=827031.3333333334, ans=0.0 2023-10-11 20:43:22,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=827031.3333333334, ans=0.125 2023-10-11 20:43:58,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=827171.3333333334, ans=0.125 2023-10-11 20:44:03,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=827171.3333333334, ans=0.1 2023-10-11 20:44:22,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.721e+02 1.914e+02 2.155e+02 2.948e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 20:44:23,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827264.6666666666, ans=0.1 2023-10-11 20:44:31,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=827311.3333333334, ans=0.025 2023-10-11 20:44:31,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.17 vs. limit=15.0 2023-10-11 20:44:41,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=827358.0, ans=0.025 2023-10-11 20:44:54,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=22.5 2023-10-11 20:44:59,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=827404.6666666666, ans=0.125 2023-10-11 20:45:01,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=827451.3333333334, ans=0.0 2023-10-11 20:45:10,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=827451.3333333334, ans=0.5 2023-10-11 20:45:19,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827498.0, ans=0.1 2023-10-11 20:45:19,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-10-11 20:45:29,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=827544.6666666666, ans=0.1 2023-10-11 20:45:31,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=827544.6666666666, ans=0.0 2023-10-11 20:45:38,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-11 20:45:56,080 INFO [train.py:1031] (3/4) Epoch 13, batch 13500, loss[loss=0.2009, simple_loss=0.289, pruned_loss=0.05644, over 16702.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2886, pruned_loss=0.05453, over 32772658.86 frames. ], batch size: 202, lr: 2.67e-03, grad_scale: 32.0 2023-10-11 20:46:00,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.12 vs. limit=15.0 2023-10-11 20:46:06,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=827731.3333333334, ans=0.1 2023-10-11 20:46:06,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=15.0 2023-10-11 20:46:08,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=827731.3333333334, ans=0.125 2023-10-11 20:46:09,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.744e+02 1.989e+02 2.497e+02 3.903e+02, threshold=3.978e+02, percent-clipped=1.0 2023-10-11 20:46:11,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=827731.3333333334, ans=0.2 2023-10-11 20:46:26,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=827778.0, ans=0.0 2023-10-11 20:46:33,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=827824.6666666666, ans=0.125 2023-10-11 20:47:01,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=827964.6666666666, ans=0.0 2023-10-11 20:47:03,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=827964.6666666666, ans=0.0 2023-10-11 20:47:24,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=828058.0, ans=0.125 2023-10-11 20:47:38,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=15.0 2023-10-11 20:47:50,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=828151.3333333334, ans=0.125 2023-10-11 20:47:58,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=828198.0, ans=0.125 2023-10-11 20:47:59,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.693e+02 1.889e+02 2.273e+02 3.403e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-11 20:48:10,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=828244.6666666666, ans=0.2 2023-10-11 20:49:11,140 INFO [train.py:1031] (3/4) Epoch 14, batch 0, loss[loss=0.1913, simple_loss=0.277, pruned_loss=0.0528, over 15579.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.277, pruned_loss=0.0528, over 15579.00 frames. ], batch size: 35, lr: 2.56e-03, grad_scale: 32.0 2023-10-11 20:49:11,141 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-11 20:49:19,450 INFO [train.py:1063] (3/4) Epoch 14, validation: loss=0.2166, simple_loss=0.3041, pruned_loss=0.06458, over 1020973.00 frames. 2023-10-11 20:49:19,450 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-11 20:49:40,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=12.0 2023-10-11 20:49:52,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=828501.3333333334, ans=0.0 2023-10-11 20:50:04,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2023-10-11 20:50:05,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.34 vs. limit=22.5 2023-10-11 20:50:06,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-10-11 20:50:24,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=828641.3333333334, ans=0.125 2023-10-11 20:50:26,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.672e+02 1.844e+02 2.142e+02 3.501e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-11 20:50:29,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=828688.0, ans=0.0 2023-10-11 20:50:36,787 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:50:46,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=828734.6666666666, ans=0.2 2023-10-11 20:50:51,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=828734.6666666666, ans=0.0 2023-10-11 20:51:07,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=828828.0, ans=0.0 2023-10-11 20:51:22,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=828874.6666666666, ans=0.125 2023-10-11 20:51:35,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=828921.3333333334, ans=0.125 2023-10-11 20:51:51,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=829014.6666666666, ans=0.0 2023-10-11 20:51:59,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=12.0 2023-10-11 20:52:10,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=829061.3333333334, ans=0.0 2023-10-11 20:52:18,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=829108.0, ans=0.125 2023-10-11 20:52:19,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=829108.0, ans=0.1 2023-10-11 20:52:22,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.671e+02 1.824e+02 1.992e+02 2.679e+02, threshold=3.647e+02, percent-clipped=0.0 2023-10-11 20:52:34,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=829201.3333333334, ans=0.2 2023-10-11 20:52:57,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=829294.6666666666, ans=0.05 2023-10-11 20:53:02,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=829294.6666666666, ans=0.125 2023-10-11 20:53:03,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2023-10-11 20:53:04,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=829294.6666666666, ans=0.125 2023-10-11 20:53:08,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=829341.3333333334, ans=0.125 2023-10-11 20:53:25,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=829388.0, ans=0.125 2023-10-11 20:53:26,558 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:53:26,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=829388.0, ans=0.0 2023-10-11 20:53:29,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=829388.0, ans=0.1 2023-10-11 20:54:08,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=829528.0, ans=0.125 2023-10-11 20:54:16,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=829574.6666666666, ans=0.125 2023-10-11 20:54:19,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.662e+02 1.850e+02 2.053e+02 2.754e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-11 20:54:50,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=829714.6666666666, ans=0.125 2023-10-11 20:54:57,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=829761.3333333334, ans=0.125 2023-10-11 20:54:57,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.80 vs. limit=15.0 2023-10-11 20:55:17,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=829808.0, ans=0.5 2023-10-11 20:55:51,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=829994.6666666666, ans=0.0 2023-10-11 20:56:02,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=830041.3333333334, ans=0.125 2023-10-11 20:56:11,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.687e+02 1.850e+02 2.133e+02 2.932e+02, threshold=3.699e+02, percent-clipped=0.0 2023-10-11 20:56:11,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.75 vs. limit=22.5 2023-10-11 20:56:15,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=830088.0, ans=10.0 2023-10-11 20:56:17,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=830088.0, ans=0.1 2023-10-11 20:56:32,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=830181.3333333334, ans=0.1 2023-10-11 20:56:36,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2023-10-11 20:56:51,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=830228.0, ans=0.05 2023-10-11 20:57:07,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.28 vs. limit=15.0 2023-10-11 20:57:17,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=830368.0, ans=0.1 2023-10-11 20:57:32,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=830414.6666666666, ans=0.025 2023-10-11 20:57:32,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=830414.6666666666, ans=0.125 2023-10-11 20:57:32,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-10-11 20:57:44,071 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-10-11 20:57:49,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=830461.3333333334, ans=0.1 2023-10-11 20:57:50,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=830508.0, ans=0.09899494936611666 2023-10-11 20:58:01,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=830508.0, ans=0.125 2023-10-11 20:58:01,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=830508.0, ans=0.125 2023-10-11 20:58:02,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-11 20:58:02,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.794e+02 2.063e+02 2.429e+02 3.388e+02, threshold=4.125e+02, percent-clipped=0.0 2023-10-11 20:58:22,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=830601.3333333334, ans=0.125 2023-10-11 20:58:22,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.85 vs. limit=15.0 2023-10-11 20:58:51,643 INFO [train.py:1031] (3/4) Epoch 14, batch 500, loss[loss=0.1823, simple_loss=0.2707, pruned_loss=0.04697, over 16929.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2883, pruned_loss=0.05445, over 7297193.01 frames. ], batch size: 123, lr: 2.56e-03, grad_scale: 32.0 2023-10-11 20:58:54,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=22.5 2023-10-11 20:58:57,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=830741.3333333334, ans=0.125 2023-10-11 20:59:03,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=830788.0, ans=0.125 2023-10-11 20:59:05,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=830788.0, ans=0.125 2023-10-11 20:59:15,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.39 vs. limit=22.5 2023-10-11 20:59:17,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.98 vs. limit=6.0 2023-10-11 20:59:36,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=830928.0, ans=0.125 2023-10-11 20:59:39,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=830928.0, ans=0.125 2023-10-11 20:59:50,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=830974.6666666666, ans=0.1 2023-10-11 20:59:52,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=830974.6666666666, ans=0.125 2023-10-11 20:59:54,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.703e+02 1.898e+02 2.181e+02 3.303e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-11 21:00:00,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=831021.3333333334, ans=0.2 2023-10-11 21:00:07,586 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.78 vs. limit=15.0 2023-10-11 21:00:16,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=831068.0, ans=0.125 2023-10-11 21:00:18,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=15.0 2023-10-11 21:00:44,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=831208.0, ans=0.125 2023-10-11 21:00:53,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=831208.0, ans=0.0 2023-10-11 21:01:07,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=831301.3333333334, ans=0.04949747468305833 2023-10-11 21:01:20,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=15.0 2023-10-11 21:01:32,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=831394.6666666666, ans=0.125 2023-10-11 21:01:36,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2023-10-11 21:01:43,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-10-11 21:01:47,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=831441.3333333334, ans=0.0 2023-10-11 21:01:49,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.712e+02 1.886e+02 2.064e+02 3.139e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-11 21:02:02,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=831534.6666666666, ans=0.125 2023-10-11 21:02:13,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=831581.3333333334, ans=0.125 2023-10-11 21:02:13,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.82 vs. limit=15.0 2023-10-11 21:02:20,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=831581.3333333334, ans=0.125 2023-10-11 21:02:35,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=831674.6666666666, ans=0.95 2023-10-11 21:02:39,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=831674.6666666666, ans=0.2 2023-10-11 21:02:44,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=831721.3333333334, ans=0.1 2023-10-11 21:03:24,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=831861.3333333334, ans=0.125 2023-10-11 21:03:29,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.08 vs. limit=15.0 2023-10-11 21:03:37,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=831908.0, ans=0.125 2023-10-11 21:03:41,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.735e+02 1.856e+02 2.090e+02 2.929e+02, threshold=3.713e+02, percent-clipped=0.0 2023-10-11 21:04:06,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832048.0, ans=0.1 2023-10-11 21:04:10,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=832048.0, ans=0.125 2023-10-11 21:04:21,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=832094.6666666666, ans=0.125 2023-10-11 21:04:59,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=832234.6666666666, ans=0.125 2023-10-11 21:05:30,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.37 vs. limit=15.0 2023-10-11 21:05:38,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.717e+02 1.854e+02 2.077e+02 2.627e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 21:05:46,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=832421.3333333334, ans=0.2 2023-10-11 21:06:07,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=832514.6666666666, ans=0.125 2023-10-11 21:06:36,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=832654.6666666666, ans=0.125 2023-10-11 21:06:53,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=832701.3333333334, ans=0.0 2023-10-11 21:06:57,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=832748.0, ans=0.125 2023-10-11 21:07:34,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.619e+02 1.755e+02 1.906e+02 3.111e+02, threshold=3.510e+02, percent-clipped=0.0 2023-10-11 21:07:38,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=832888.0, ans=0.2 2023-10-11 21:07:49,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=832934.6666666666, ans=0.0 2023-10-11 21:08:01,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=832934.6666666666, ans=0.0 2023-10-11 21:08:07,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=832981.3333333334, ans=0.125 2023-10-11 21:08:16,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=833028.0, ans=0.1 2023-10-11 21:08:19,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=833028.0, ans=0.125 2023-10-11 21:08:25,200 INFO [train.py:1031] (3/4) Epoch 14, batch 1000, loss[loss=0.1795, simple_loss=0.2749, pruned_loss=0.042, over 16806.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2892, pruned_loss=0.05483, over 12936561.79 frames. ], batch size: 98, lr: 2.55e-03, grad_scale: 32.0 2023-10-11 21:08:27,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=833074.6666666666, ans=0.1 2023-10-11 21:08:27,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=833074.6666666666, ans=0.125 2023-10-11 21:08:32,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=833074.6666666666, ans=0.125 2023-10-11 21:08:34,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=833121.3333333334, ans=0.125 2023-10-11 21:08:36,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=833121.3333333334, ans=0.2 2023-10-11 21:08:38,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=833121.3333333334, ans=0.035 2023-10-11 21:08:47,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=833168.0, ans=0.125 2023-10-11 21:08:57,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=833214.6666666666, ans=0.125 2023-10-11 21:09:07,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=833261.3333333334, ans=0.0 2023-10-11 21:09:12,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-10-11 21:09:29,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.660e+02 1.812e+02 2.035e+02 2.588e+02, threshold=3.625e+02, percent-clipped=0.0 2023-10-11 21:09:33,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=833354.6666666666, ans=0.125 2023-10-11 21:09:43,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=833401.3333333334, ans=0.0 2023-10-11 21:09:44,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=833401.3333333334, ans=0.0 2023-10-11 21:09:45,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=833401.3333333334, ans=0.035 2023-10-11 21:09:48,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-11 21:10:07,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=833494.6666666666, ans=0.1 2023-10-11 21:11:19,054 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.00 vs. limit=22.5 2023-10-11 21:11:24,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.786e+02 2.012e+02 2.341e+02 3.564e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-11 21:11:51,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=833868.0, ans=0.0 2023-10-11 21:11:56,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=833914.6666666666, ans=0.0 2023-10-11 21:12:11,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=833961.3333333334, ans=0.125 2023-10-11 21:12:16,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=833961.3333333334, ans=0.2 2023-10-11 21:12:22,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=834008.0, ans=0.125 2023-10-11 21:12:26,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=834008.0, ans=0.125 2023-10-11 21:12:48,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=834101.3333333334, ans=0.125 2023-10-11 21:12:51,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=15.0 2023-10-11 21:13:06,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.02 vs. limit=15.0 2023-10-11 21:13:11,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-11 21:13:18,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=834241.3333333334, ans=0.125 2023-10-11 21:13:28,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.648e+02 1.837e+02 2.021e+02 2.785e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-11 21:13:30,118 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:13:39,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834334.6666666666, ans=0.1 2023-10-11 21:13:41,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=834334.6666666666, ans=0.2 2023-10-11 21:13:45,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=834334.6666666666, ans=0.04949747468305833 2023-10-11 21:13:53,371 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=22.5 2023-10-11 21:14:32,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.99 vs. limit=15.0 2023-10-11 21:14:46,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834614.6666666666, ans=0.1 2023-10-11 21:15:06,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=834708.0, ans=0.125 2023-10-11 21:15:14,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.786e+02 2.004e+02 2.395e+02 3.657e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-11 21:15:15,137 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-11 21:15:16,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-11 21:15:17,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=834754.6666666666, ans=0.125 2023-10-11 21:15:18,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=834754.6666666666, ans=0.1 2023-10-11 21:15:25,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-10-11 21:15:47,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=834848.0, ans=0.125 2023-10-11 21:15:48,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=834848.0, ans=0.2 2023-10-11 21:15:54,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=834848.0, ans=0.1 2023-10-11 21:16:06,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=834894.6666666666, ans=0.0 2023-10-11 21:16:08,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=834894.6666666666, ans=0.125 2023-10-11 21:16:15,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=834941.3333333334, ans=0.125 2023-10-11 21:16:16,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.30 vs. limit=15.0 2023-10-11 21:16:38,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=835034.6666666666, ans=0.125 2023-10-11 21:16:41,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=835034.6666666666, ans=0.1 2023-10-11 21:16:52,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-11 21:17:01,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=835128.0, ans=0.125 2023-10-11 21:17:07,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=835174.6666666666, ans=0.125 2023-10-11 21:17:22,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.722e+02 1.882e+02 2.150e+02 2.876e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 21:17:37,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=835268.0, ans=0.07 2023-10-11 21:17:44,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=835314.6666666666, ans=0.0 2023-10-11 21:17:45,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.26 vs. limit=10.0 2023-10-11 21:17:58,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=835361.3333333334, ans=0.125 2023-10-11 21:18:02,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=835361.3333333334, ans=0.2 2023-10-11 21:18:04,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.66 vs. limit=12.0 2023-10-11 21:18:08,804 INFO [train.py:1031] (3/4) Epoch 14, batch 1500, loss[loss=0.1955, simple_loss=0.2797, pruned_loss=0.05568, over 16843.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2875, pruned_loss=0.05381, over 17348779.82 frames. ], batch size: 72, lr: 2.55e-03, grad_scale: 16.0 2023-10-11 21:18:37,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=835501.3333333334, ans=0.025 2023-10-11 21:18:47,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2023-10-11 21:18:48,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=835548.0, ans=0.125 2023-10-11 21:18:49,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=835548.0, ans=0.2 2023-10-11 21:18:50,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-10-11 21:19:02,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=835594.6666666666, ans=0.0 2023-10-11 21:19:08,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835641.3333333334, ans=0.1 2023-10-11 21:19:12,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=835641.3333333334, ans=0.95 2023-10-11 21:19:16,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.664e+02 1.873e+02 2.074e+02 2.666e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 21:19:27,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=835734.6666666666, ans=0.07 2023-10-11 21:19:30,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.79 vs. limit=22.5 2023-10-11 21:19:33,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=835734.6666666666, ans=0.0 2023-10-11 21:19:44,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=835781.3333333334, ans=0.1 2023-10-11 21:19:50,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=835828.0, ans=0.07 2023-10-11 21:20:14,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=835921.3333333334, ans=0.2 2023-10-11 21:20:27,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=835968.0, ans=0.0 2023-10-11 21:20:36,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=836014.6666666666, ans=0.125 2023-10-11 21:20:47,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=836014.6666666666, ans=0.125 2023-10-11 21:20:58,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=836061.3333333334, ans=0.1 2023-10-11 21:21:15,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.688e+02 1.815e+02 1.975e+02 2.958e+02, threshold=3.630e+02, percent-clipped=0.0 2023-10-11 21:21:30,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=836201.3333333334, ans=0.125 2023-10-11 21:22:20,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=836434.6666666666, ans=0.125 2023-10-11 21:22:21,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=836434.6666666666, ans=0.125 2023-10-11 21:22:35,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=836481.3333333334, ans=0.04949747468305833 2023-10-11 21:22:54,480 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.67 vs. limit=15.0 2023-10-11 21:23:05,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.689e+02 1.892e+02 2.084e+02 3.451e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 21:23:14,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-10-11 21:23:15,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=836668.0, ans=0.125 2023-10-11 21:23:19,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=836668.0, ans=0.125 2023-10-11 21:23:27,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-10-11 21:23:44,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=836761.3333333334, ans=0.125 2023-10-11 21:23:54,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=15.0 2023-10-11 21:23:55,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=836808.0, ans=0.125 2023-10-11 21:23:57,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=836808.0, ans=0.125 2023-10-11 21:24:03,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=836808.0, ans=0.0 2023-10-11 21:24:29,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.15 vs. limit=15.0 2023-10-11 21:24:34,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=836948.0, ans=0.125 2023-10-11 21:24:35,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-11 21:24:40,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836994.6666666666, ans=0.1 2023-10-11 21:24:43,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-11 21:24:47,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=836994.6666666666, ans=0.125 2023-10-11 21:25:01,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.709e+02 1.885e+02 2.107e+02 3.233e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 21:25:10,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=837134.6666666666, ans=0.125 2023-10-11 21:25:16,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=837134.6666666666, ans=0.125 2023-10-11 21:25:23,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-11 21:25:36,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=837228.0, ans=0.125 2023-10-11 21:25:49,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=837274.6666666666, ans=0.0 2023-10-11 21:25:59,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=837321.3333333334, ans=0.1 2023-10-11 21:26:17,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.91 vs. limit=10.0 2023-10-11 21:26:23,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=22.5 2023-10-11 21:27:00,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.714e+02 1.865e+02 2.110e+02 3.389e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-11 21:27:10,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=837554.6666666666, ans=0.1 2023-10-11 21:27:19,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=837601.3333333334, ans=0.07 2023-10-11 21:27:19,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=837601.3333333334, ans=0.1 2023-10-11 21:27:20,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=837601.3333333334, ans=0.0 2023-10-11 21:27:46,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=837741.3333333334, ans=0.125 2023-10-11 21:27:47,625 INFO [train.py:1031] (3/4) Epoch 14, batch 2000, loss[loss=0.1857, simple_loss=0.2821, pruned_loss=0.04464, over 16812.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2881, pruned_loss=0.05395, over 20783653.08 frames. ], batch size: 67, lr: 2.55e-03, grad_scale: 32.0 2023-10-11 21:27:54,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=837741.3333333334, ans=0.125 2023-10-11 21:28:18,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=837834.6666666666, ans=0.125 2023-10-11 21:28:35,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=837881.3333333334, ans=0.125 2023-10-11 21:28:50,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=837928.0, ans=0.125 2023-10-11 21:28:56,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=837974.6666666666, ans=0.125 2023-10-11 21:29:07,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.698e+02 1.844e+02 2.072e+02 3.190e+02, threshold=3.687e+02, percent-clipped=0.0 2023-10-11 21:29:24,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=838068.0, ans=0.125 2023-10-11 21:29:51,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=838161.3333333334, ans=0.0 2023-10-11 21:30:24,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=838254.6666666666, ans=0.125 2023-10-11 21:30:51,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=838348.0, ans=0.125 2023-10-11 21:31:03,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.13 vs. limit=10.0 2023-10-11 21:31:06,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=838394.6666666666, ans=0.0 2023-10-11 21:31:25,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.686e+02 1.873e+02 2.143e+02 2.836e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 21:32:21,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=838674.6666666666, ans=0.125 2023-10-11 21:32:27,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=838721.3333333334, ans=0.2 2023-10-11 21:32:38,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=838768.0, ans=0.125 2023-10-11 21:32:47,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=838814.6666666666, ans=0.0 2023-10-11 21:33:01,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=838861.3333333334, ans=0.125 2023-10-11 21:33:02,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-10-11 21:33:20,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.842e+02 1.984e+02 2.247e+02 3.185e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-11 21:33:24,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=838954.6666666666, ans=0.2 2023-10-11 21:33:38,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=839001.3333333334, ans=10.0 2023-10-11 21:33:42,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=839048.0, ans=0.125 2023-10-11 21:33:59,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=839094.6666666666, ans=0.125 2023-10-11 21:34:09,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=839141.3333333334, ans=0.125 2023-10-11 21:34:10,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=839141.3333333334, ans=0.125 2023-10-11 21:34:30,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=839234.6666666666, ans=0.0 2023-10-11 21:34:41,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=839281.3333333334, ans=0.125 2023-10-11 21:34:46,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-10-11 21:34:47,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=839328.0, ans=0.1 2023-10-11 21:34:57,949 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:35:00,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=839374.6666666666, ans=0.125 2023-10-11 21:35:09,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.800e+02 1.980e+02 2.176e+02 3.241e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-11 21:35:12,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=839421.3333333334, ans=0.0 2023-10-11 21:35:22,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=839468.0, ans=0.1 2023-10-11 21:35:52,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=839608.0, ans=0.0 2023-10-11 21:36:03,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=839654.6666666666, ans=0.0 2023-10-11 21:36:13,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=839701.3333333334, ans=0.025 2023-10-11 21:36:31,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=839748.0, ans=0.125 2023-10-11 21:36:56,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.786e+02 1.939e+02 2.166e+02 2.700e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-11 21:37:21,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=839981.3333333334, ans=0.125 2023-10-11 21:37:32,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=840028.0, ans=0.0 2023-10-11 21:37:34,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=840028.0, ans=0.125 2023-10-11 21:37:35,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=840028.0, ans=0.0 2023-10-11 21:37:38,764 INFO [train.py:1031] (3/4) Epoch 14, batch 2500, loss[loss=0.2445, simple_loss=0.3072, pruned_loss=0.09084, over 15597.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.288, pruned_loss=0.05413, over 23435800.21 frames. ], batch size: 350, lr: 2.54e-03, grad_scale: 32.0 2023-10-11 21:37:43,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=840074.6666666666, ans=0.125 2023-10-11 21:38:25,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=840261.3333333334, ans=0.0 2023-10-11 21:38:29,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=840261.3333333334, ans=0.125 2023-10-11 21:38:32,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=840308.0, ans=0.1 2023-10-11 21:38:39,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-10-11 21:38:44,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.785e+02 1.981e+02 2.360e+02 2.983e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-11 21:39:28,214 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.02 vs. limit=15.0 2023-10-11 21:39:48,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=840588.0, ans=0.125 2023-10-11 21:40:08,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=840681.3333333334, ans=0.125 2023-10-11 21:40:22,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=840774.6666666666, ans=0.0 2023-10-11 21:40:23,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=840774.6666666666, ans=0.2 2023-10-11 21:40:32,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2023-10-11 21:40:34,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.734e+02 1.918e+02 2.136e+02 3.051e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 21:40:57,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=840914.6666666666, ans=0.125 2023-10-11 21:41:12,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=840961.3333333334, ans=0.125 2023-10-11 21:41:13,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=840961.3333333334, ans=0.1 2023-10-11 21:41:15,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=841008.0, ans=0.125 2023-10-11 21:41:35,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2023-10-11 21:41:36,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.63 vs. limit=22.5 2023-10-11 21:41:37,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=841054.6666666666, ans=0.05 2023-10-11 21:41:41,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.08 vs. limit=22.5 2023-10-11 21:41:45,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=841101.3333333334, ans=0.125 2023-10-11 21:41:45,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=841101.3333333334, ans=0.125 2023-10-11 21:42:31,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=841241.3333333334, ans=0.2 2023-10-11 21:42:33,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=841241.3333333334, ans=0.125 2023-10-11 21:42:37,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.707e+02 1.867e+02 2.076e+02 2.903e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 21:42:38,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=841288.0, ans=0.125 2023-10-11 21:42:41,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=841288.0, ans=0.125 2023-10-11 21:42:42,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=841288.0, ans=0.0 2023-10-11 21:42:51,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.67 vs. limit=22.5 2023-10-11 21:43:00,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=841381.3333333334, ans=0.125 2023-10-11 21:43:11,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=841381.3333333334, ans=0.125 2023-10-11 21:43:32,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=841474.6666666666, ans=0.0 2023-10-11 21:43:46,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=841521.3333333334, ans=0.1 2023-10-11 21:43:46,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841521.3333333334, ans=0.1 2023-10-11 21:44:00,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=841614.6666666666, ans=0.125 2023-10-11 21:44:07,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=841614.6666666666, ans=0.0 2023-10-11 21:44:07,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-10-11 21:44:36,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=841708.0, ans=0.0 2023-10-11 21:44:39,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.663e+02 1.846e+02 2.146e+02 2.837e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-11 21:44:48,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841754.6666666666, ans=0.1 2023-10-11 21:45:47,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=841988.0, ans=0.125 2023-10-11 21:46:08,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=842081.3333333334, ans=0.125 2023-10-11 21:46:33,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=842174.6666666666, ans=0.0 2023-10-11 21:46:41,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.644e+02 1.822e+02 2.006e+02 2.815e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 21:46:45,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-10-11 21:46:55,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=842268.0, ans=0.125 2023-10-11 21:47:03,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842314.6666666666, ans=0.1 2023-10-11 21:47:12,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=842314.6666666666, ans=0.125 2023-10-11 21:47:18,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842361.3333333334, ans=0.1 2023-10-11 21:47:21,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.48 vs. limit=22.5 2023-10-11 21:47:24,917 INFO [train.py:1031] (3/4) Epoch 14, batch 3000, loss[loss=0.1658, simple_loss=0.2626, pruned_loss=0.03455, over 16945.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2873, pruned_loss=0.05403, over 25525029.36 frames. ], batch size: 93, lr: 2.54e-03, grad_scale: 32.0 2023-10-11 21:47:26,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=842408.0, ans=0.125 2023-10-11 21:47:27,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=842408.0, ans=0.125 2023-10-11 21:47:38,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2023-10-11 21:47:47,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=842501.3333333334, ans=0.015 2023-10-11 21:48:06,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=842548.0, ans=0.125 2023-10-11 21:48:07,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=842548.0, ans=0.0 2023-10-11 21:48:21,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=842641.3333333334, ans=0.0 2023-10-11 21:48:22,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.98 vs. limit=10.0 2023-10-11 21:48:29,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=842641.3333333334, ans=0.125 2023-10-11 21:48:35,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.842e+02 2.010e+02 2.218e+02 3.047e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-11 21:49:12,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=842828.0, ans=0.125 2023-10-11 21:49:30,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=842874.6666666666, ans=0.0 2023-10-11 21:49:47,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=842921.3333333334, ans=0.1 2023-10-11 21:49:47,234 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:49:50,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=842968.0, ans=0.125 2023-10-11 21:50:20,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=843108.0, ans=0.0 2023-10-11 21:50:33,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.699e+02 1.854e+02 2.054e+02 2.717e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 21:50:41,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=843154.6666666666, ans=0.2 2023-10-11 21:50:46,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=843201.3333333334, ans=0.125 2023-10-11 21:50:55,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-10-11 21:50:58,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=843248.0, ans=0.125 2023-10-11 21:51:04,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.16 vs. limit=15.0 2023-10-11 21:51:09,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=843294.6666666666, ans=0.125 2023-10-11 21:51:21,851 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-10-11 21:51:28,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.64 vs. limit=15.0 2023-10-11 21:52:15,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=843528.0, ans=0.0 2023-10-11 21:52:40,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.815e+02 2.048e+02 2.381e+02 3.871e+02, threshold=4.096e+02, percent-clipped=1.0 2023-10-11 21:53:04,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=843714.6666666666, ans=0.125 2023-10-11 21:53:15,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=843761.3333333334, ans=0.125 2023-10-11 21:53:20,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=843761.3333333334, ans=10.0 2023-10-11 21:53:38,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=843854.6666666666, ans=0.125 2023-10-11 21:54:04,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=843948.0, ans=0.125 2023-10-11 21:54:05,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-10-11 21:54:14,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.91 vs. limit=15.0 2023-10-11 21:54:19,721 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:54:23,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=844041.3333333334, ans=0.125 2023-10-11 21:54:32,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.721e+02 1.955e+02 2.202e+02 3.303e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-11 21:54:51,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=844134.6666666666, ans=0.0 2023-10-11 21:54:54,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=844134.6666666666, ans=0.95 2023-10-11 21:55:09,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=844181.3333333334, ans=0.125 2023-10-11 21:55:13,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=844228.0, ans=0.125 2023-10-11 21:55:14,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=844228.0, ans=0.1 2023-10-11 21:55:31,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=844274.6666666666, ans=0.0 2023-10-11 21:55:39,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=844321.3333333334, ans=0.0 2023-10-11 21:55:40,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=844321.3333333334, ans=0.0 2023-10-11 21:55:42,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=844321.3333333334, ans=0.125 2023-10-11 21:55:52,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=844368.0, ans=0.05 2023-10-11 21:55:58,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=844414.6666666666, ans=0.0 2023-10-11 21:56:03,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=844414.6666666666, ans=0.0 2023-10-11 21:56:16,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=844508.0, ans=0.125 2023-10-11 21:56:19,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=844508.0, ans=0.0 2023-10-11 21:56:29,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.713e+02 1.863e+02 2.076e+02 2.541e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-11 21:56:36,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=844601.3333333334, ans=0.125 2023-10-11 21:56:52,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=844648.0, ans=0.0 2023-10-11 21:56:57,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.32 vs. limit=15.0 2023-10-11 21:57:10,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=844694.6666666666, ans=0.1 2023-10-11 21:57:14,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.34 vs. limit=22.5 2023-10-11 21:57:14,519 INFO [train.py:1031] (3/4) Epoch 14, batch 3500, loss[loss=0.2291, simple_loss=0.3064, pruned_loss=0.07591, over 16096.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.287, pruned_loss=0.05403, over 27122714.17 frames. ], batch size: 296, lr: 2.54e-03, grad_scale: 16.0 2023-10-11 21:57:21,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.61 vs. limit=12.0 2023-10-11 21:57:26,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=844788.0, ans=0.0 2023-10-11 21:57:40,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=844834.6666666666, ans=22.5 2023-10-11 21:57:47,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=844881.3333333334, ans=0.2 2023-10-11 21:58:06,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=844974.6666666666, ans=0.0 2023-10-11 21:58:11,058 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.89 vs. limit=15.0 2023-10-11 21:58:19,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845021.3333333334, ans=0.1 2023-10-11 21:58:21,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.770e+02 1.936e+02 2.082e+02 2.703e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-11 21:58:34,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=845068.0, ans=0.125 2023-10-11 21:58:35,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-10-11 21:58:44,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=845068.0, ans=0.1 2023-10-11 21:59:00,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=845161.3333333334, ans=0.125 2023-10-11 21:59:15,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-10-11 21:59:57,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=845394.6666666666, ans=0.125 2023-10-11 22:00:01,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=845394.6666666666, ans=0.0 2023-10-11 22:00:09,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=845441.3333333334, ans=0.125 2023-10-11 22:00:12,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-11 22:00:17,986 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-11 22:00:20,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.754e+02 1.892e+02 2.190e+02 3.191e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 22:00:20,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=845488.0, ans=0.1 2023-10-11 22:00:22,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=845488.0, ans=15.0 2023-10-11 22:00:38,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.56 vs. limit=22.5 2023-10-11 22:01:00,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=845628.0, ans=0.125 2023-10-11 22:01:04,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=845674.6666666666, ans=0.2 2023-10-11 22:01:06,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=845674.6666666666, ans=0.125 2023-10-11 22:01:14,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=845721.3333333334, ans=0.2 2023-10-11 22:01:17,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=845721.3333333334, ans=0.125 2023-10-11 22:01:42,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=845814.6666666666, ans=0.2 2023-10-11 22:01:50,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=22.5 2023-10-11 22:01:52,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=12.0 2023-10-11 22:01:58,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=845861.3333333334, ans=0.125 2023-10-11 22:02:04,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=845861.3333333334, ans=0.04949747468305833 2023-10-11 22:02:04,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.07 vs. limit=22.5 2023-10-11 22:02:19,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.777e+02 1.932e+02 2.184e+02 3.257e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-11 22:02:34,166 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:02:34,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=846001.3333333334, ans=0.125 2023-10-11 22:02:38,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=846048.0, ans=0.125 2023-10-11 22:02:39,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=846048.0, ans=0.125 2023-10-11 22:02:47,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=846048.0, ans=0.125 2023-10-11 22:02:56,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=846094.6666666666, ans=0.07 2023-10-11 22:03:08,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=846141.3333333334, ans=0.0 2023-10-11 22:03:24,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=846188.0, ans=0.0 2023-10-11 22:03:43,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=846281.3333333334, ans=0.1 2023-10-11 22:04:08,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-10-11 22:04:13,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.673e+02 1.827e+02 2.027e+02 3.473e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-11 22:04:20,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=846421.3333333334, ans=0.125 2023-10-11 22:04:32,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=846514.6666666666, ans=10.0 2023-10-11 22:04:52,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=846561.3333333334, ans=0.0 2023-10-11 22:04:55,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=846608.0, ans=0.125 2023-10-11 22:05:00,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.55 vs. limit=10.0 2023-10-11 22:05:01,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=846608.0, ans=0.1 2023-10-11 22:05:17,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-10-11 22:05:21,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=846701.3333333334, ans=0.125 2023-10-11 22:05:38,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=846794.6666666666, ans=0.2 2023-10-11 22:05:49,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=846841.3333333334, ans=0.125 2023-10-11 22:06:02,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.683e+02 1.826e+02 2.205e+02 2.978e+02, threshold=3.653e+02, percent-clipped=0.0 2023-10-11 22:06:02,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=846888.0, ans=0.1 2023-10-11 22:06:14,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=846934.6666666666, ans=0.0 2023-10-11 22:06:21,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=846981.3333333334, ans=0.125 2023-10-11 22:06:22,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.06 vs. limit=6.0 2023-10-11 22:06:44,762 INFO [train.py:1031] (3/4) Epoch 14, batch 4000, loss[loss=0.2151, simple_loss=0.3087, pruned_loss=0.06078, over 16633.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2868, pruned_loss=0.05415, over 28360603.71 frames. ], batch size: 241, lr: 2.53e-03, grad_scale: 32.0 2023-10-11 22:06:55,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=847074.6666666666, ans=0.125 2023-10-11 22:07:01,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=847121.3333333334, ans=0.125 2023-10-11 22:07:02,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847121.3333333334, ans=0.1 2023-10-11 22:07:14,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=847168.0, ans=0.2 2023-10-11 22:07:24,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.53 vs. limit=15.0 2023-10-11 22:07:26,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=847214.6666666666, ans=0.125 2023-10-11 22:07:26,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=847214.6666666666, ans=0.125 2023-10-11 22:07:30,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=847214.6666666666, ans=0.2 2023-10-11 22:07:58,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.798e+02 1.954e+02 2.171e+02 2.909e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-11 22:08:12,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=847401.3333333334, ans=0.0 2023-10-11 22:08:35,128 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2023-10-11 22:08:36,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-10-11 22:09:01,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=847634.6666666666, ans=0.125 2023-10-11 22:09:25,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=847728.0, ans=0.0 2023-10-11 22:09:37,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-10-11 22:09:48,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=847821.3333333334, ans=0.125 2023-10-11 22:09:50,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.705e+02 1.890e+02 2.091e+02 3.406e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 22:09:57,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=847821.3333333334, ans=0.0 2023-10-11 22:10:30,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=847914.6666666666, ans=0.1 2023-10-11 22:10:33,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=847961.3333333334, ans=0.04949747468305833 2023-10-11 22:10:50,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=848008.0, ans=0.125 2023-10-11 22:10:59,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=848054.6666666666, ans=0.125 2023-10-11 22:10:59,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-10-11 22:11:01,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=848054.6666666666, ans=0.125 2023-10-11 22:11:04,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=848054.6666666666, ans=0.0 2023-10-11 22:11:08,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=848101.3333333334, ans=0.0 2023-10-11 22:11:21,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=848148.0, ans=0.125 2023-10-11 22:11:50,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=848241.3333333334, ans=0.2 2023-10-11 22:11:58,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.606e+02 1.787e+02 2.023e+02 2.818e+02, threshold=3.574e+02, percent-clipped=0.0 2023-10-11 22:11:58,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.14 vs. limit=15.0 2023-10-11 22:12:12,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=848334.6666666666, ans=0.09899494936611666 2023-10-11 22:12:17,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=848381.3333333334, ans=0.0 2023-10-11 22:12:28,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=848428.0, ans=0.125 2023-10-11 22:12:39,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=848474.6666666666, ans=0.05 2023-10-11 22:12:47,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=848521.3333333334, ans=0.125 2023-10-11 22:12:49,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=848521.3333333334, ans=0.125 2023-10-11 22:12:55,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=848521.3333333334, ans=0.0 2023-10-11 22:13:09,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=848568.0, ans=0.1 2023-10-11 22:13:24,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=848661.3333333334, ans=0.0 2023-10-11 22:13:29,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=848661.3333333334, ans=0.0 2023-10-11 22:13:48,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.773e+02 1.935e+02 2.158e+02 3.067e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-11 22:14:08,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=22.5 2023-10-11 22:14:25,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=848894.6666666666, ans=0.125 2023-10-11 22:14:28,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=848894.6666666666, ans=0.05 2023-10-11 22:14:36,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=848941.3333333334, ans=0.2 2023-10-11 22:14:47,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=848988.0, ans=0.125 2023-10-11 22:14:49,863 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:15:02,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=849034.6666666666, ans=0.125 2023-10-11 22:15:29,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=849128.0, ans=0.125 2023-10-11 22:15:42,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=849174.6666666666, ans=0.125 2023-10-11 22:15:42,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-10-11 22:15:49,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=849174.6666666666, ans=0.0 2023-10-11 22:15:49,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=849221.3333333334, ans=0.05 2023-10-11 22:15:54,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.750e+02 1.895e+02 2.116e+02 3.117e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-11 22:16:35,044 INFO [train.py:1031] (3/4) Epoch 14, batch 4500, loss[loss=0.1783, simple_loss=0.2501, pruned_loss=0.05325, over 12726.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2872, pruned_loss=0.05399, over 29360802.60 frames. ], batch size: 440, lr: 2.53e-03, grad_scale: 32.0 2023-10-11 22:16:56,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=849501.3333333334, ans=0.125 2023-10-11 22:17:02,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-10-11 22:17:18,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2023-10-11 22:17:19,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=849594.6666666666, ans=0.0 2023-10-11 22:17:34,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-11 22:17:36,677 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:17:41,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.694e+02 1.877e+02 2.131e+02 3.021e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-11 22:17:58,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=849734.6666666666, ans=0.125 2023-10-11 22:18:21,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2023-10-11 22:18:48,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=849968.0, ans=0.125 2023-10-11 22:19:01,641 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:19:26,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.68 vs. limit=15.0 2023-10-11 22:19:28,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.756e+02 1.950e+02 2.181e+02 2.976e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-11 22:19:29,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=850154.6666666666, ans=0.0 2023-10-11 22:19:38,986 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.26 vs. limit=10.0 2023-10-11 22:19:39,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=850201.3333333334, ans=0.0 2023-10-11 22:19:45,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=850201.3333333334, ans=0.125 2023-10-11 22:19:58,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=850248.0, ans=0.125 2023-10-11 22:20:02,729 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.09 vs. limit=15.0 2023-10-11 22:20:12,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.81 vs. limit=10.0 2023-10-11 22:20:18,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=850341.3333333334, ans=0.125 2023-10-11 22:20:20,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-10-11 22:20:26,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=850388.0, ans=0.07 2023-10-11 22:20:44,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.14 vs. limit=22.5 2023-10-11 22:20:45,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=850481.3333333334, ans=0.125 2023-10-11 22:20:58,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=850528.0, ans=0.125 2023-10-11 22:21:00,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=850528.0, ans=0.125 2023-10-11 22:21:04,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=850574.6666666666, ans=0.0 2023-10-11 22:21:05,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=850574.6666666666, ans=0.09899494936611666 2023-10-11 22:21:13,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=15.0 2023-10-11 22:21:15,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=850621.3333333334, ans=0.0 2023-10-11 22:21:18,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.738e+02 1.942e+02 2.238e+02 3.168e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-11 22:21:38,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850714.6666666666, ans=0.1 2023-10-11 22:21:38,959 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:21:42,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=850714.6666666666, ans=0.125 2023-10-11 22:21:45,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2023-10-11 22:21:53,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850761.3333333334, ans=0.1 2023-10-11 22:22:20,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=850901.3333333334, ans=0.125 2023-10-11 22:22:23,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=850901.3333333334, ans=0.125 2023-10-11 22:22:25,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=850901.3333333334, ans=0.125 2023-10-11 22:22:28,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=850901.3333333334, ans=0.0 2023-10-11 22:22:33,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=850948.0, ans=0.0 2023-10-11 22:22:45,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=850994.6666666666, ans=0.125 2023-10-11 22:22:47,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=850994.6666666666, ans=0.125 2023-10-11 22:23:12,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.774e+02 1.995e+02 2.118e+02 3.240e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-11 22:23:16,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=851088.0, ans=0.0 2023-10-11 22:23:19,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=851134.6666666666, ans=0.0 2023-10-11 22:23:28,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-10-11 22:24:02,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=851274.6666666666, ans=0.1 2023-10-11 22:24:19,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=851368.0, ans=0.0 2023-10-11 22:24:42,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=851461.3333333334, ans=0.2 2023-10-11 22:24:45,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=851461.3333333334, ans=0.1 2023-10-11 22:24:47,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=851461.3333333334, ans=0.125 2023-10-11 22:24:47,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=851461.3333333334, ans=0.0 2023-10-11 22:24:52,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=851508.0, ans=0.125 2023-10-11 22:25:01,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-10-11 22:25:07,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.698e+02 1.886e+02 2.130e+02 2.962e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-11 22:25:07,441 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:25:07,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=851554.6666666666, ans=0.125 2023-10-11 22:25:21,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=851601.3333333334, ans=0.0 2023-10-11 22:25:48,273 INFO [train.py:1031] (3/4) Epoch 14, batch 5000, loss[loss=0.2049, simple_loss=0.2607, pruned_loss=0.07451, over 12441.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2871, pruned_loss=0.05432, over 30094340.54 frames. ], batch size: 440, lr: 2.53e-03, grad_scale: 32.0 2023-10-11 22:25:51,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=851741.3333333334, ans=0.2 2023-10-11 22:25:51,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2023-10-11 22:25:58,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=851741.3333333334, ans=0.2 2023-10-11 22:26:00,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-11 22:26:03,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=851788.0, ans=0.125 2023-10-11 22:26:09,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-10-11 22:26:31,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=851881.3333333334, ans=0.125 2023-10-11 22:26:39,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=851928.0, ans=10.0 2023-10-11 22:26:39,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=851928.0, ans=0.125 2023-10-11 22:26:41,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=851928.0, ans=0.125 2023-10-11 22:26:49,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=851974.6666666666, ans=0.2 2023-10-11 22:26:54,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=852021.3333333334, ans=0.125 2023-10-11 22:26:54,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-10-11 22:26:56,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=852021.3333333334, ans=0.125 2023-10-11 22:26:57,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.782e+02 2.029e+02 2.371e+02 3.765e+02, threshold=4.058e+02, percent-clipped=0.0 2023-10-11 22:27:00,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=852021.3333333334, ans=0.1 2023-10-11 22:27:22,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.79 vs. limit=10.0 2023-10-11 22:27:27,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=852161.3333333334, ans=0.125 2023-10-11 22:27:30,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=852161.3333333334, ans=0.125 2023-10-11 22:27:52,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=852208.0, ans=0.025 2023-10-11 22:27:55,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-10-11 22:27:59,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=852254.6666666666, ans=0.125 2023-10-11 22:28:15,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=852301.3333333334, ans=0.1 2023-10-11 22:28:18,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-10-11 22:28:23,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=852348.0, ans=0.125 2023-10-11 22:28:25,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=852348.0, ans=0.125 2023-10-11 22:28:34,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=852394.6666666666, ans=0.125 2023-10-11 22:28:55,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.749e+02 1.932e+02 2.163e+02 2.902e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-11 22:28:58,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=852488.0, ans=0.1 2023-10-11 22:29:09,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.37 vs. limit=12.0 2023-10-11 22:29:10,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=15.0 2023-10-11 22:29:31,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852628.0, ans=0.0 2023-10-11 22:29:33,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=852628.0, ans=0.125 2023-10-11 22:29:36,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-10-11 22:29:39,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=852674.6666666666, ans=0.2 2023-10-11 22:30:05,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852768.0, ans=0.0 2023-10-11 22:30:07,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=852814.6666666666, ans=0.1 2023-10-11 22:30:09,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=852814.6666666666, ans=0.0 2023-10-11 22:30:13,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=852814.6666666666, ans=0.2 2023-10-11 22:30:15,153 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-11 22:30:17,299 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.13 vs. limit=12.0 2023-10-11 22:30:33,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=852908.0, ans=0.125 2023-10-11 22:30:35,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-10-11 22:30:49,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.718e+02 1.892e+02 2.094e+02 3.758e+02, threshold=3.784e+02, percent-clipped=0.0 2023-10-11 22:31:04,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=853001.3333333334, ans=0.2 2023-10-11 22:31:06,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=853001.3333333334, ans=0.125 2023-10-11 22:31:15,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=853048.0, ans=0.125 2023-10-11 22:31:17,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=853048.0, ans=0.2 2023-10-11 22:31:26,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=853094.6666666666, ans=0.125 2023-10-11 22:31:28,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=853094.6666666666, ans=0.125 2023-10-11 22:31:30,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=853094.6666666666, ans=0.0 2023-10-11 22:31:43,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.49 vs. limit=22.5 2023-10-11 22:31:45,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=853188.0, ans=0.0 2023-10-11 22:31:45,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.84 vs. limit=15.0 2023-10-11 22:31:53,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=853234.6666666666, ans=0.125 2023-10-11 22:31:56,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=853234.6666666666, ans=0.1 2023-10-11 22:32:06,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.25 vs. limit=22.5 2023-10-11 22:32:11,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-10-11 22:32:15,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-10-11 22:32:20,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=853328.0, ans=0.2 2023-10-11 22:32:31,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=853374.6666666666, ans=0.07 2023-10-11 22:32:44,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.688e+02 1.879e+02 2.172e+02 3.902e+02, threshold=3.758e+02, percent-clipped=1.0 2023-10-11 22:32:46,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=853421.3333333334, ans=0.125 2023-10-11 22:33:00,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=853468.0, ans=0.125 2023-10-11 22:33:01,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=853468.0, ans=0.125 2023-10-11 22:33:02,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=853514.6666666666, ans=0.5 2023-10-11 22:33:07,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=853514.6666666666, ans=0.2 2023-10-11 22:33:29,582 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.99 vs. limit=15.0 2023-10-11 22:33:51,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=853701.3333333334, ans=0.0 2023-10-11 22:33:57,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=853748.0, ans=0.2 2023-10-11 22:34:00,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-10-11 22:34:17,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=853841.3333333334, ans=0.1 2023-10-11 22:34:31,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.678e+02 1.838e+02 2.112e+02 2.920e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-11 22:34:39,105 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:34:49,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=853934.6666666666, ans=0.125 2023-10-11 22:35:11,691 INFO [train.py:1031] (3/4) Epoch 14, batch 5500, loss[loss=0.2496, simple_loss=0.3123, pruned_loss=0.09342, over 15673.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2869, pruned_loss=0.05402, over 30725152.61 frames. ], batch size: 350, lr: 2.52e-03, grad_scale: 32.0 2023-10-11 22:35:31,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=854121.3333333334, ans=0.125 2023-10-11 22:35:34,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=854168.0, ans=0.0 2023-10-11 22:35:34,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=854168.0, ans=0.125 2023-10-11 22:35:38,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=854168.0, ans=0.1 2023-10-11 22:35:41,524 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:35:51,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=854214.6666666666, ans=0.2 2023-10-11 22:35:56,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=854261.3333333334, ans=0.125 2023-10-11 22:36:01,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=854261.3333333334, ans=0.05 2023-10-11 22:36:03,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-10-11 22:36:12,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=854308.0, ans=0.035 2023-10-11 22:36:19,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.657e+02 1.796e+02 1.994e+02 2.634e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 22:36:35,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=854401.3333333334, ans=0.2 2023-10-11 22:36:40,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=854448.0, ans=0.2 2023-10-11 22:36:45,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=854448.0, ans=0.125 2023-10-11 22:37:25,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=854634.6666666666, ans=0.1 2023-10-11 22:37:34,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-11 22:37:48,800 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:37:48,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=854728.0, ans=0.125 2023-10-11 22:37:56,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=854774.6666666666, ans=0.125 2023-10-11 22:37:57,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=854774.6666666666, ans=0.125 2023-10-11 22:38:09,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=854821.3333333334, ans=0.2 2023-10-11 22:38:11,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.747e+02 1.902e+02 2.265e+02 3.147e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-11 22:38:25,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=854868.0, ans=0.125 2023-10-11 22:38:27,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.89 vs. limit=22.5 2023-10-11 22:38:28,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=854868.0, ans=0.07 2023-10-11 22:38:28,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=854868.0, ans=0.0 2023-10-11 22:38:44,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=854961.3333333334, ans=0.125 2023-10-11 22:39:14,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=855054.6666666666, ans=0.2 2023-10-11 22:39:35,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=855148.0, ans=0.125 2023-10-11 22:39:50,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=855241.3333333334, ans=0.0 2023-10-11 22:39:59,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=855288.0, ans=0.0 2023-10-11 22:40:04,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.318e+02 1.689e+02 1.843e+02 2.037e+02 2.671e+02, threshold=3.686e+02, percent-clipped=0.0 2023-10-11 22:40:08,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=855288.0, ans=0.0 2023-10-11 22:40:12,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=855334.6666666666, ans=0.125 2023-10-11 22:40:20,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=855334.6666666666, ans=0.04949747468305833 2023-10-11 22:40:48,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=855474.6666666666, ans=0.125 2023-10-11 22:40:48,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855474.6666666666, ans=0.0 2023-10-11 22:40:50,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=855474.6666666666, ans=0.2 2023-10-11 22:41:21,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=855614.6666666666, ans=0.1 2023-10-11 22:41:33,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=855661.3333333334, ans=0.04949747468305833 2023-10-11 22:41:38,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=855661.3333333334, ans=0.125 2023-10-11 22:41:38,480 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-11 22:41:55,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.697e+02 1.864e+02 2.077e+02 2.805e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 22:42:03,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=855801.3333333334, ans=0.2 2023-10-11 22:42:09,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-10-11 22:42:13,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=855848.0, ans=0.125 2023-10-11 22:42:36,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=855941.3333333334, ans=0.1 2023-10-11 22:42:39,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2023-10-11 22:42:40,599 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.95 vs. limit=22.5 2023-10-11 22:42:45,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=855941.3333333334, ans=0.125 2023-10-11 22:42:48,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=855988.0, ans=0.0 2023-10-11 22:43:37,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.81 vs. limit=10.0 2023-10-11 22:43:50,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.719e+02 1.856e+02 2.017e+02 2.807e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 22:44:27,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=15.0 2023-10-11 22:44:28,842 INFO [train.py:1031] (3/4) Epoch 14, batch 6000, loss[loss=0.198, simple_loss=0.286, pruned_loss=0.05497, over 16934.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2873, pruned_loss=0.05434, over 31178185.90 frames. ], batch size: 77, lr: 2.52e-03, grad_scale: 32.0 2023-10-11 22:44:32,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=856408.0, ans=0.2 2023-10-11 22:44:55,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=856501.3333333334, ans=0.125 2023-10-11 22:45:00,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=856501.3333333334, ans=0.125 2023-10-11 22:45:00,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=856501.3333333334, ans=0.1 2023-10-11 22:45:08,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=856548.0, ans=0.2 2023-10-11 22:45:41,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.770e+02 1.974e+02 2.183e+02 2.812e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-11 22:45:42,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.68 vs. limit=22.5 2023-10-11 22:45:46,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=856688.0, ans=0.125 2023-10-11 22:46:01,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.24 vs. limit=22.5 2023-10-11 22:46:07,829 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:46:09,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=856828.0, ans=0.0 2023-10-11 22:46:10,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=856828.0, ans=0.125 2023-10-11 22:46:10,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=856828.0, ans=0.125 2023-10-11 22:46:10,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=22.5 2023-10-11 22:46:16,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=856828.0, ans=0.125 2023-10-11 22:46:25,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=856874.6666666666, ans=0.125 2023-10-11 22:46:28,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=856874.6666666666, ans=0.0 2023-10-11 22:46:37,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=856921.3333333334, ans=0.2 2023-10-11 22:46:41,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=856921.3333333334, ans=0.025 2023-10-11 22:47:15,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=857061.3333333334, ans=0.125 2023-10-11 22:47:23,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-11 22:47:35,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.775e+02 1.930e+02 2.246e+02 3.746e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-11 22:47:43,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=857201.3333333334, ans=0.125 2023-10-11 22:47:56,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=857248.0, ans=0.125 2023-10-11 22:48:03,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=857248.0, ans=0.09899494936611666 2023-10-11 22:48:23,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-10-11 22:48:34,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=857388.0, ans=0.125 2023-10-11 22:48:40,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=857434.6666666666, ans=0.125 2023-10-11 22:49:15,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=857574.6666666666, ans=0.2 2023-10-11 22:49:17,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.19 vs. limit=22.5 2023-10-11 22:49:24,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=857621.3333333334, ans=0.125 2023-10-11 22:49:25,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.674e+02 1.965e+02 2.226e+02 3.549e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-11 22:49:43,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=857714.6666666666, ans=0.1 2023-10-11 22:49:49,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-10-11 22:49:51,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.21 vs. limit=15.0 2023-10-11 22:49:53,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=857714.6666666666, ans=0.0 2023-10-11 22:50:01,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=857761.3333333334, ans=0.0 2023-10-11 22:50:05,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=857808.0, ans=0.125 2023-10-11 22:50:12,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=857808.0, ans=0.125 2023-10-11 22:50:12,749 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.02 vs. limit=15.0 2023-10-11 22:50:21,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=857854.6666666666, ans=0.125 2023-10-11 22:50:26,238 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:50:28,168 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:50:40,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=857901.3333333334, ans=0.1 2023-10-11 22:50:57,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=857994.6666666666, ans=0.125 2023-10-11 22:51:16,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=858041.3333333334, ans=0.125 2023-10-11 22:51:19,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=858088.0, ans=0.125 2023-10-11 22:51:20,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=858088.0, ans=0.0 2023-10-11 22:51:25,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.687e+02 1.887e+02 2.192e+02 4.312e+02, threshold=3.773e+02, percent-clipped=1.0 2023-10-11 22:51:47,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-11 22:51:59,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-10-11 22:52:39,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=858368.0, ans=0.2 2023-10-11 22:53:18,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=858554.6666666666, ans=0.1 2023-10-11 22:53:20,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=858554.6666666666, ans=0.0 2023-10-11 22:53:21,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=858554.6666666666, ans=0.2 2023-10-11 22:53:21,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=858554.6666666666, ans=0.1 2023-10-11 22:53:21,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.654e+02 1.823e+02 2.129e+02 3.051e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-11 22:53:25,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=858554.6666666666, ans=0.125 2023-10-11 22:53:41,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-11 22:53:54,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=858694.6666666666, ans=0.125 2023-10-11 22:54:03,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=858694.6666666666, ans=0.2 2023-10-11 22:54:04,755 INFO [train.py:1031] (3/4) Epoch 14, batch 6500, loss[loss=0.1889, simple_loss=0.2851, pruned_loss=0.04635, over 16887.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2878, pruned_loss=0.05458, over 31519231.46 frames. ], batch size: 87, lr: 2.51e-03, grad_scale: 32.0 2023-10-11 22:54:22,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=858788.0, ans=0.125 2023-10-11 22:54:50,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=858881.3333333334, ans=0.2 2023-10-11 22:54:52,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=858881.3333333334, ans=0.1 2023-10-11 22:55:06,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.32 vs. limit=15.0 2023-10-11 22:55:07,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.76 vs. limit=22.5 2023-10-11 22:55:21,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=859021.3333333334, ans=0.0 2023-10-11 22:55:27,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.773e+02 1.948e+02 2.113e+02 3.177e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-11 22:55:27,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=859021.3333333334, ans=0.0 2023-10-11 22:55:34,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=859068.0, ans=0.125 2023-10-11 22:56:09,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2023-10-11 22:56:18,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=859254.6666666666, ans=0.125 2023-10-11 22:56:22,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-10-11 22:56:45,270 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:56:45,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=859348.0, ans=6.0 2023-10-11 22:56:50,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=859394.6666666666, ans=0.125 2023-10-11 22:56:52,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=859394.6666666666, ans=0.0 2023-10-11 22:57:05,120 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:57:08,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=859488.0, ans=0.125 2023-10-11 22:57:10,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=859488.0, ans=0.125 2023-10-11 22:57:14,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.748e+02 1.938e+02 2.197e+02 3.048e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-11 22:57:35,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=859581.3333333334, ans=0.0 2023-10-11 22:57:37,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=859581.3333333334, ans=0.07 2023-10-11 22:57:52,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=859674.6666666666, ans=0.125 2023-10-11 22:58:01,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=859721.3333333334, ans=0.09899494936611666 2023-10-11 22:58:23,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=859814.6666666666, ans=0.125 2023-10-11 22:58:38,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=859861.3333333334, ans=0.125 2023-10-11 22:58:43,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=859861.3333333334, ans=0.125 2023-10-11 22:59:06,000 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-11 22:59:06,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.739e+02 1.893e+02 2.190e+02 2.880e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 22:59:06,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=859954.6666666666, ans=0.1 2023-10-11 22:59:09,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=859954.6666666666, ans=0.1 2023-10-11 22:59:27,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.59 vs. limit=12.0 2023-10-11 22:59:36,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=860094.6666666666, ans=0.0 2023-10-11 22:59:36,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=860094.6666666666, ans=15.0 2023-10-11 22:59:41,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=860094.6666666666, ans=0.125 2023-10-11 22:59:50,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=860141.3333333334, ans=0.125 2023-10-11 22:59:52,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.77 vs. limit=12.0 2023-10-11 22:59:55,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=860141.3333333334, ans=0.1 2023-10-11 23:00:27,706 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:00:32,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=860234.6666666666, ans=0.5 2023-10-11 23:00:57,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.30 vs. limit=22.5 2023-10-11 23:01:11,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=860421.3333333334, ans=0.125 2023-10-11 23:01:18,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.318e+02 1.650e+02 1.822e+02 2.121e+02 2.809e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 23:01:24,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=860468.0, ans=0.0 2023-10-11 23:01:37,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=860514.6666666666, ans=0.125 2023-10-11 23:01:52,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=860561.3333333334, ans=0.0 2023-10-11 23:02:08,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=860608.0, ans=0.0 2023-10-11 23:02:15,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=860654.6666666666, ans=0.125 2023-10-11 23:02:59,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=860841.3333333334, ans=0.125 2023-10-11 23:03:12,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.732e+02 1.858e+02 2.109e+02 3.102e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-11 23:03:35,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=860981.3333333334, ans=0.5 2023-10-11 23:03:48,402 INFO [train.py:1031] (3/4) Epoch 14, batch 7000, loss[loss=0.1947, simple_loss=0.2792, pruned_loss=0.05508, over 15481.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2881, pruned_loss=0.05424, over 31821047.73 frames. ], batch size: 35, lr: 2.51e-03, grad_scale: 16.0 2023-10-11 23:03:49,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-10-11 23:04:16,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=861168.0, ans=0.0 2023-10-11 23:04:22,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=861168.0, ans=0.125 2023-10-11 23:04:42,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=8.0 2023-10-11 23:04:46,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-11 23:04:47,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861308.0, ans=0.1 2023-10-11 23:04:50,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=861308.0, ans=0.125 2023-10-11 23:04:56,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-11 23:05:02,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=861354.6666666666, ans=0.125 2023-10-11 23:05:05,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.762e+02 1.897e+02 2.141e+02 3.248e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-11 23:05:05,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861354.6666666666, ans=0.1 2023-10-11 23:05:09,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=861401.3333333334, ans=0.125 2023-10-11 23:05:11,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=861401.3333333334, ans=0.0 2023-10-11 23:05:25,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=861448.0, ans=0.125 2023-10-11 23:05:42,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=22.5 2023-10-11 23:05:53,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=861588.0, ans=0.2 2023-10-11 23:05:54,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=861588.0, ans=0.2 2023-10-11 23:06:13,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=861634.6666666666, ans=0.125 2023-10-11 23:06:22,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=861681.3333333334, ans=0.0 2023-10-11 23:06:34,816 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:06:58,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.724e+02 1.849e+02 2.098e+02 3.047e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 23:07:07,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=861868.0, ans=0.2 2023-10-11 23:07:11,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=861868.0, ans=0.125 2023-10-11 23:07:13,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.87 vs. limit=15.0 2023-10-11 23:07:14,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=861914.6666666666, ans=0.125 2023-10-11 23:07:23,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=861961.3333333334, ans=0.125 2023-10-11 23:07:54,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=862054.6666666666, ans=0.125 2023-10-11 23:07:55,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-10-11 23:08:13,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=862101.3333333334, ans=0.0 2023-10-11 23:08:13,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=862101.3333333334, ans=0.125 2023-10-11 23:08:13,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.46 vs. limit=22.5 2023-10-11 23:08:39,951 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:08:47,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=862241.3333333334, ans=0.0 2023-10-11 23:08:53,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=862241.3333333334, ans=0.2 2023-10-11 23:08:54,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=862241.3333333334, ans=0.125 2023-10-11 23:09:06,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.684e+02 1.859e+02 2.216e+02 3.139e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-11 23:09:19,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=862334.6666666666, ans=0.025 2023-10-11 23:09:41,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2023-10-11 23:09:46,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862474.6666666666, ans=0.1 2023-10-11 23:10:01,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=862521.3333333334, ans=0.125 2023-10-11 23:10:02,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=862521.3333333334, ans=0.0 2023-10-11 23:10:03,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=862521.3333333334, ans=0.1 2023-10-11 23:10:17,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862568.0, ans=0.1 2023-10-11 23:10:22,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862614.6666666666, ans=0.1 2023-10-11 23:10:27,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=862614.6666666666, ans=0.2 2023-10-11 23:10:33,376 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.27 vs. limit=15.0 2023-10-11 23:10:34,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=862661.3333333334, ans=0.125 2023-10-11 23:10:37,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=862661.3333333334, ans=0.125 2023-10-11 23:10:39,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.39 vs. limit=10.0 2023-10-11 23:10:48,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=862708.0, ans=0.0 2023-10-11 23:10:58,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-11 23:11:06,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.737e+02 1.914e+02 2.173e+02 3.091e+02, threshold=3.829e+02, percent-clipped=0.0 2023-10-11 23:11:19,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=862801.3333333334, ans=0.0 2023-10-11 23:11:30,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2023-10-11 23:11:44,570 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:11:46,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=862941.3333333334, ans=0.0 2023-10-11 23:11:54,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=862988.0, ans=0.125 2023-10-11 23:12:01,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=862988.0, ans=15.0 2023-10-11 23:12:02,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=862988.0, ans=0.0 2023-10-11 23:12:05,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=863034.6666666666, ans=0.125 2023-10-11 23:12:25,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2023-10-11 23:12:26,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=863128.0, ans=0.0 2023-10-11 23:12:34,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=863128.0, ans=10.0 2023-10-11 23:12:35,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=863128.0, ans=0.125 2023-10-11 23:12:41,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863174.6666666666, ans=0.1 2023-10-11 23:12:53,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.835e+02 2.087e+02 2.441e+02 3.560e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-11 23:12:54,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=863221.3333333334, ans=0.1 2023-10-11 23:13:14,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=863314.6666666666, ans=0.125 2023-10-11 23:13:19,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=863361.3333333334, ans=0.1 2023-10-11 23:13:29,997 INFO [train.py:1031] (3/4) Epoch 14, batch 7500, loss[loss=0.218, simple_loss=0.3051, pruned_loss=0.06544, over 16893.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.288, pruned_loss=0.05443, over 32011579.98 frames. ], batch size: 110, lr: 2.51e-03, grad_scale: 32.0 2023-10-11 23:13:32,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.11 vs. limit=6.0 2023-10-11 23:13:51,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=863501.3333333334, ans=0.1 2023-10-11 23:14:10,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=15.0 2023-10-11 23:14:46,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.716e+02 1.898e+02 2.028e+02 2.979e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 23:14:46,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=863688.0, ans=0.1 2023-10-11 23:15:04,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.81 vs. limit=15.0 2023-10-11 23:15:09,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=12.0 2023-10-11 23:15:15,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=863828.0, ans=0.04949747468305833 2023-10-11 23:15:34,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=863921.3333333334, ans=0.0 2023-10-11 23:16:00,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=864014.6666666666, ans=0.125 2023-10-11 23:16:45,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=864154.6666666666, ans=0.0 2023-10-11 23:16:51,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.672e+02 1.872e+02 2.094e+02 2.810e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-11 23:17:37,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-10-11 23:17:51,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=864434.6666666666, ans=0.1 2023-10-11 23:18:24,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=864574.6666666666, ans=0.125 2023-10-11 23:18:37,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-10-11 23:18:40,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=864621.3333333334, ans=0.2 2023-10-11 23:18:42,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.681e+02 1.808e+02 1.986e+02 2.443e+02, threshold=3.616e+02, percent-clipped=0.0 2023-10-11 23:18:43,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=864621.3333333334, ans=0.125 2023-10-11 23:19:11,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=864761.3333333334, ans=0.0 2023-10-11 23:19:20,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=864808.0, ans=0.0 2023-10-11 23:19:43,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=864901.3333333334, ans=0.0 2023-10-11 23:19:46,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=864901.3333333334, ans=0.125 2023-10-11 23:19:48,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.17 vs. limit=10.0 2023-10-11 23:20:02,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=864948.0, ans=0.2 2023-10-11 23:20:03,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-10-11 23:20:07,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=864994.6666666666, ans=0.0 2023-10-11 23:20:09,837 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:20:36,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=865088.0, ans=0.125 2023-10-11 23:20:37,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.702e+02 1.899e+02 2.123e+02 2.680e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-11 23:20:50,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=865134.6666666666, ans=0.125 2023-10-11 23:20:58,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.78 vs. limit=5.0 2023-10-11 23:21:12,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=865274.6666666666, ans=0.2 2023-10-11 23:21:35,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.33 vs. limit=15.0 2023-10-11 23:21:40,509 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:22:02,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=865461.3333333334, ans=15.0 2023-10-11 23:22:31,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.07 vs. limit=15.0 2023-10-11 23:22:32,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.647e+02 1.783e+02 2.028e+02 3.925e+02, threshold=3.565e+02, percent-clipped=1.0 2023-10-11 23:22:46,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=865601.3333333334, ans=0.1 2023-10-11 23:22:49,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=865648.0, ans=0.125 2023-10-11 23:23:10,387 INFO [train.py:1031] (3/4) Epoch 14, batch 8000, loss[loss=0.1752, simple_loss=0.2684, pruned_loss=0.04097, over 16405.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2873, pruned_loss=0.05389, over 32173212.32 frames. ], batch size: 50, lr: 2.50e-03, grad_scale: 32.0 2023-10-11 23:23:54,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=865928.0, ans=0.0 2023-10-11 23:24:12,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=866021.3333333334, ans=0.2 2023-10-11 23:24:13,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=866021.3333333334, ans=0.1 2023-10-11 23:24:21,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.599e+02 1.762e+02 1.959e+02 2.488e+02, threshold=3.525e+02, percent-clipped=0.0 2023-10-11 23:24:42,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=866114.6666666666, ans=15.0 2023-10-11 23:24:44,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=866161.3333333334, ans=10.0 2023-10-11 23:24:44,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=866161.3333333334, ans=0.2 2023-10-11 23:24:56,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=866208.0, ans=0.125 2023-10-11 23:25:27,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-10-11 23:25:28,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=866348.0, ans=0.1 2023-10-11 23:25:38,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=866394.6666666666, ans=0.125 2023-10-11 23:25:58,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=866441.3333333334, ans=0.125 2023-10-11 23:26:07,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=866441.3333333334, ans=0.05 2023-10-11 23:26:14,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=866488.0, ans=0.1 2023-10-11 23:26:16,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=22.5 2023-10-11 23:26:23,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.684e+02 1.869e+02 2.126e+02 2.927e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-11 23:26:48,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=866581.3333333334, ans=10.0 2023-10-11 23:26:49,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=866581.3333333334, ans=0.1 2023-10-11 23:27:01,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.20 vs. limit=10.0 2023-10-11 23:27:12,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-10-11 23:27:57,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=866861.3333333334, ans=0.1 2023-10-11 23:28:06,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=866908.0, ans=0.0 2023-10-11 23:28:11,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=866908.0, ans=0.1 2023-10-11 23:28:24,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.711e+02 1.920e+02 2.196e+02 3.511e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-11 23:28:27,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=867001.3333333334, ans=0.0 2023-10-11 23:28:29,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=867001.3333333334, ans=0.0 2023-10-11 23:28:45,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=867048.0, ans=0.0 2023-10-11 23:28:50,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=867094.6666666666, ans=0.0 2023-10-11 23:29:03,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=867141.3333333334, ans=0.125 2023-10-11 23:29:14,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=867188.0, ans=0.1 2023-10-11 23:29:26,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=867234.6666666666, ans=0.0 2023-10-11 23:29:42,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.96 vs. limit=10.0 2023-10-11 23:29:47,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=867281.3333333334, ans=0.125 2023-10-11 23:30:09,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=867374.6666666666, ans=0.0 2023-10-11 23:30:16,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=867421.3333333334, ans=0.2 2023-10-11 23:30:21,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.770e+02 1.865e+02 2.101e+02 2.625e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-11 23:30:55,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=867608.0, ans=0.125 2023-10-11 23:31:21,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=867654.6666666666, ans=0.125 2023-10-11 23:31:21,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=867654.6666666666, ans=0.0 2023-10-11 23:31:41,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=867748.0, ans=0.07 2023-10-11 23:32:17,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.731e+02 1.895e+02 2.142e+02 2.955e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-11 23:32:31,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=867981.3333333334, ans=0.125 2023-10-11 23:32:33,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=867981.3333333334, ans=0.125 2023-10-11 23:32:39,622 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:32:51,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=868028.0, ans=0.0 2023-10-11 23:32:52,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=868028.0, ans=0.0 2023-10-11 23:33:00,271 INFO [train.py:1031] (3/4) Epoch 14, batch 8500, loss[loss=0.1997, simple_loss=0.2951, pruned_loss=0.0521, over 16900.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2877, pruned_loss=0.05389, over 32308255.55 frames. ], batch size: 87, lr: 2.50e-03, grad_scale: 32.0 2023-10-11 23:33:09,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=868121.3333333334, ans=0.125 2023-10-11 23:33:09,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=868121.3333333334, ans=0.1 2023-10-11 23:33:12,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=868121.3333333334, ans=0.0 2023-10-11 23:33:24,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=868168.0, ans=0.2 2023-10-11 23:33:41,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=868214.6666666666, ans=0.0 2023-10-11 23:33:49,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=868261.3333333334, ans=0.1 2023-10-11 23:34:01,283 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:34:14,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=868354.6666666666, ans=0.125 2023-10-11 23:34:17,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=868354.6666666666, ans=0.125 2023-10-11 23:34:18,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.875e+02 2.114e+02 2.375e+02 3.607e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-11 23:34:23,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=868401.3333333334, ans=0.0 2023-10-11 23:34:26,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=868401.3333333334, ans=0.125 2023-10-11 23:34:31,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=868448.0, ans=0.0 2023-10-11 23:34:38,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=868448.0, ans=0.125 2023-10-11 23:34:42,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=868494.6666666666, ans=0.125 2023-10-11 23:34:52,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=868494.6666666666, ans=0.0 2023-10-11 23:35:15,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=868588.0, ans=0.0 2023-10-11 23:35:20,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=868588.0, ans=0.125 2023-10-11 23:35:57,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=868728.0, ans=0.125 2023-10-11 23:35:59,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=868774.6666666666, ans=0.125 2023-10-11 23:36:21,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.269e+02 1.688e+02 1.873e+02 2.106e+02 2.921e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 23:37:10,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=869008.0, ans=0.2 2023-10-11 23:37:12,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=869008.0, ans=0.125 2023-10-11 23:37:15,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=869054.6666666666, ans=0.125 2023-10-11 23:37:18,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=869054.6666666666, ans=0.2 2023-10-11 23:37:39,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=869148.0, ans=0.0 2023-10-11 23:37:48,403 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-10-11 23:37:50,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=869194.6666666666, ans=0.125 2023-10-11 23:37:58,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=869194.6666666666, ans=0.0 2023-10-11 23:38:07,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=869241.3333333334, ans=0.0 2023-10-11 23:38:11,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=869241.3333333334, ans=0.09899494936611666 2023-10-11 23:38:26,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=869288.0, ans=0.1 2023-10-11 23:38:30,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.652e+02 1.746e+02 1.972e+02 3.700e+02, threshold=3.493e+02, percent-clipped=0.0 2023-10-11 23:38:33,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-11 23:38:45,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=869381.3333333334, ans=0.5 2023-10-11 23:38:55,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=869428.0, ans=0.125 2023-10-11 23:38:58,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=869428.0, ans=0.125 2023-10-11 23:38:59,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=869428.0, ans=0.1 2023-10-11 23:39:08,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=869474.6666666666, ans=0.125 2023-10-11 23:39:09,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-10-11 23:39:10,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.79 vs. limit=15.0 2023-10-11 23:39:42,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=869614.6666666666, ans=0.0 2023-10-11 23:39:43,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-10-11 23:40:02,772 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-10-11 23:40:03,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.79 vs. limit=10.0 2023-10-11 23:40:20,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.683e+02 1.845e+02 2.150e+02 3.471e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 23:40:32,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=869848.0, ans=0.125 2023-10-11 23:40:48,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=869894.6666666666, ans=0.125 2023-10-11 23:41:11,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=869988.0, ans=0.1 2023-10-11 23:41:18,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=870034.6666666666, ans=0.2 2023-10-11 23:41:24,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-10-11 23:41:26,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=870034.6666666666, ans=0.125 2023-10-11 23:41:30,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870081.3333333334, ans=0.1 2023-10-11 23:41:39,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=870128.0, ans=0.125 2023-10-11 23:41:44,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=870128.0, ans=0.2 2023-10-11 23:41:46,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=870128.0, ans=0.2 2023-10-11 23:42:12,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.734e+02 1.885e+02 2.189e+02 3.141e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 23:42:33,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=870314.6666666666, ans=0.0 2023-10-11 23:42:48,019 INFO [train.py:1031] (3/4) Epoch 14, batch 9000, loss[loss=0.2173, simple_loss=0.3122, pruned_loss=0.06123, over 16673.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2869, pruned_loss=0.05342, over 32433412.13 frames. ], batch size: 241, lr: 2.50e-03, grad_scale: 32.0 2023-10-11 23:42:52,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=870408.0, ans=0.125 2023-10-11 23:42:56,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=12.0 2023-10-11 23:42:58,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=870454.6666666666, ans=0.1 2023-10-11 23:43:08,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=22.5 2023-10-11 23:43:15,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=870501.3333333334, ans=0.04949747468305833 2023-10-11 23:43:42,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=870641.3333333334, ans=0.125 2023-10-11 23:43:49,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=12.0 2023-10-11 23:43:59,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.687e+02 1.863e+02 2.077e+02 3.074e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 23:44:21,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=870781.3333333334, ans=0.1 2023-10-11 23:44:24,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=870828.0, ans=0.125 2023-10-11 23:44:29,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.18 vs. limit=10.0 2023-10-11 23:44:45,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=870921.3333333334, ans=0.2 2023-10-11 23:44:46,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=870921.3333333334, ans=0.125 2023-10-11 23:44:49,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=870921.3333333334, ans=0.0 2023-10-11 23:44:50,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=870921.3333333334, ans=0.2 2023-10-11 23:45:11,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=871014.6666666666, ans=0.025 2023-10-11 23:45:24,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871108.0, ans=0.1 2023-10-11 23:45:31,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=871108.0, ans=0.5 2023-10-11 23:45:34,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=22.5 2023-10-11 23:45:44,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.756e+02 2.016e+02 2.238e+02 3.481e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-11 23:46:13,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871294.6666666666, ans=0.1 2023-10-11 23:46:38,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=871388.0, ans=0.0 2023-10-11 23:46:41,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=871434.6666666666, ans=0.2 2023-10-11 23:46:41,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=871434.6666666666, ans=0.125 2023-10-11 23:47:05,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.06 vs. limit=22.5 2023-10-11 23:47:15,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.61 vs. limit=15.0 2023-10-11 23:47:19,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=871574.6666666666, ans=0.125 2023-10-11 23:47:19,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=871574.6666666666, ans=0.1 2023-10-11 23:47:23,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871621.3333333334, ans=0.1 2023-10-11 23:47:31,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.722e+02 1.893e+02 2.157e+02 3.696e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 23:47:43,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=871668.0, ans=0.125 2023-10-11 23:47:49,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=871714.6666666666, ans=0.0 2023-10-11 23:47:51,652 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:47:57,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=871761.3333333334, ans=0.125 2023-10-11 23:48:30,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=871901.3333333334, ans=0.125 2023-10-11 23:49:04,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=871994.6666666666, ans=0.125 2023-10-11 23:49:11,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=872041.3333333334, ans=0.125 2023-10-11 23:49:11,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872041.3333333334, ans=0.1 2023-10-11 23:49:32,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.768e+02 1.901e+02 2.294e+02 3.356e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 23:49:39,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=872134.6666666666, ans=0.1 2023-10-11 23:49:48,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=872181.3333333334, ans=0.125 2023-10-11 23:49:53,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=872181.3333333334, ans=0.0 2023-10-11 23:49:57,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872181.3333333334, ans=0.1 2023-10-11 23:50:36,326 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:50:38,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.87 vs. limit=15.0 2023-10-11 23:50:43,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=872368.0, ans=0.125 2023-10-11 23:50:59,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=872461.3333333334, ans=0.1 2023-10-11 23:51:11,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=872508.0, ans=0.125 2023-10-11 23:51:26,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-10-11 23:51:26,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=872554.6666666666, ans=0.125 2023-10-11 23:51:30,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.779e+02 1.994e+02 2.183e+02 2.813e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-11 23:51:47,597 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:51:56,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.75 vs. limit=15.0 2023-10-11 23:51:59,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=872694.6666666666, ans=10.0 2023-10-11 23:52:03,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=872694.6666666666, ans=0.125 2023-10-11 23:52:10,287 INFO [train.py:1031] (3/4) Epoch 14, batch 9500, loss[loss=0.2059, simple_loss=0.3016, pruned_loss=0.05514, over 16560.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2881, pruned_loss=0.05413, over 32505500.74 frames. ], batch size: 267, lr: 2.49e-03, grad_scale: 32.0 2023-10-11 23:52:28,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=872788.0, ans=0.0 2023-10-11 23:52:33,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=872834.6666666666, ans=0.025 2023-10-11 23:52:34,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=872834.6666666666, ans=0.1 2023-10-11 23:52:45,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-10-11 23:52:46,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.36 vs. limit=15.0 2023-10-11 23:52:52,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=15.0 2023-10-11 23:53:01,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=872928.0, ans=0.0 2023-10-11 23:53:15,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=872974.6666666666, ans=0.125 2023-10-11 23:53:18,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2023-10-11 23:53:24,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=873021.3333333334, ans=0.1 2023-10-11 23:53:27,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.737e+02 1.883e+02 2.252e+02 3.994e+02, threshold=3.766e+02, percent-clipped=1.0 2023-10-11 23:53:37,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=873068.0, ans=0.05 2023-10-11 23:54:14,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=873254.6666666666, ans=0.125 2023-10-11 23:54:27,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=8.0 2023-10-11 23:54:30,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=873301.3333333334, ans=0.125 2023-10-11 23:54:50,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=873394.6666666666, ans=0.0 2023-10-11 23:55:07,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=873441.3333333334, ans=0.125 2023-10-11 23:55:09,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=873488.0, ans=0.125 2023-10-11 23:55:17,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=873488.0, ans=0.125 2023-10-11 23:55:20,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.754e+02 1.923e+02 2.256e+02 2.725e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-11 23:55:40,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=873581.3333333334, ans=0.2 2023-10-11 23:55:55,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=873628.0, ans=0.125 2023-10-11 23:56:00,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=873674.6666666666, ans=0.0 2023-10-11 23:56:17,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=873721.3333333334, ans=0.125 2023-10-11 23:56:18,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=873721.3333333334, ans=0.125 2023-10-11 23:56:24,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-10-11 23:56:29,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-10-11 23:56:33,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=873814.6666666666, ans=0.05 2023-10-11 23:56:42,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=873861.3333333334, ans=0.2 2023-10-11 23:57:02,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=873908.0, ans=0.5 2023-10-11 23:57:02,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-10-11 23:57:07,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=873954.6666666666, ans=0.125 2023-10-11 23:57:08,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=873954.6666666666, ans=0.2 2023-10-11 23:57:12,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=873954.6666666666, ans=15.0 2023-10-11 23:57:13,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.642e+02 1.759e+02 1.905e+02 2.849e+02, threshold=3.517e+02, percent-clipped=0.0 2023-10-11 23:57:13,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=874001.3333333334, ans=0.0 2023-10-11 23:57:24,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=15.0 2023-10-11 23:57:38,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=874048.0, ans=0.2 2023-10-11 23:57:52,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=874094.6666666666, ans=0.0 2023-10-11 23:57:52,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=874094.6666666666, ans=0.2 2023-10-11 23:58:06,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=874188.0, ans=0.125 2023-10-11 23:58:10,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=874188.0, ans=0.07 2023-10-11 23:58:21,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-10-11 23:58:31,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=874281.3333333334, ans=0.125 2023-10-11 23:58:37,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=874328.0, ans=0.07 2023-10-11 23:58:39,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=874328.0, ans=0.0 2023-10-11 23:59:09,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.630e+02 1.756e+02 1.959e+02 2.658e+02, threshold=3.511e+02, percent-clipped=0.0 2023-10-11 23:59:10,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-10-11 23:59:11,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=874468.0, ans=0.125 2023-10-11 23:59:15,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=874468.0, ans=0.2 2023-10-11 23:59:34,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=874561.3333333334, ans=0.125 2023-10-11 23:59:54,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=874654.6666666666, ans=0.125 2023-10-11 23:59:57,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-12 00:00:00,389 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:00:03,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=874654.6666666666, ans=0.1 2023-10-12 00:00:21,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=874748.0, ans=0.05 2023-10-12 00:00:33,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=874794.6666666666, ans=0.125 2023-10-12 00:00:42,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-10-12 00:00:45,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=874841.3333333334, ans=0.125 2023-10-12 00:01:00,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.658e+02 1.799e+02 1.944e+02 3.303e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-12 00:01:01,266 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:01:23,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=875028.0, ans=0.0 2023-10-12 00:01:31,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=875074.6666666666, ans=0.0 2023-10-12 00:01:32,354 INFO [train.py:1031] (3/4) Epoch 14, batch 10000, loss[loss=0.2039, simple_loss=0.2898, pruned_loss=0.05895, over 16558.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.287, pruned_loss=0.05369, over 32556956.97 frames. ], batch size: 56, lr: 2.49e-03, grad_scale: 32.0 2023-10-12 00:01:34,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=875074.6666666666, ans=0.1 2023-10-12 00:01:40,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-10-12 00:01:54,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=875168.0, ans=0.0 2023-10-12 00:02:00,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=875168.0, ans=0.05 2023-10-12 00:02:21,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=875261.3333333334, ans=0.1 2023-10-12 00:02:45,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.714e+02 1.868e+02 2.127e+02 2.853e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 00:02:47,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=875401.3333333334, ans=0.125 2023-10-12 00:02:49,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=875401.3333333334, ans=0.125 2023-10-12 00:02:49,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=875401.3333333334, ans=0.0 2023-10-12 00:03:02,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2023-10-12 00:03:40,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-12 00:03:41,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.11 vs. limit=22.5 2023-10-12 00:03:43,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=875634.6666666666, ans=0.125 2023-10-12 00:03:58,902 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-12 00:04:00,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-12 00:04:06,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=875728.0, ans=0.125 2023-10-12 00:04:07,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=875728.0, ans=0.0 2023-10-12 00:04:12,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=22.5 2023-10-12 00:04:38,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.739e+02 1.911e+02 2.085e+02 2.980e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-12 00:04:39,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=875868.0, ans=0.0 2023-10-12 00:04:46,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875914.6666666666, ans=0.1 2023-10-12 00:04:54,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=875914.6666666666, ans=0.0 2023-10-12 00:05:02,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=875961.3333333334, ans=0.0 2023-10-12 00:05:06,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=875961.3333333334, ans=0.1 2023-10-12 00:05:28,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=15.0 2023-10-12 00:05:40,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=876101.3333333334, ans=0.04949747468305833 2023-10-12 00:05:41,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876101.3333333334, ans=0.1 2023-10-12 00:05:47,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-10-12 00:06:06,977 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-12 00:06:07,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=876194.6666666666, ans=0.125 2023-10-12 00:06:10,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.64 vs. limit=10.0 2023-10-12 00:06:12,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=876241.3333333334, ans=0.0 2023-10-12 00:06:15,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=876241.3333333334, ans=0.125 2023-10-12 00:06:33,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.735e+02 1.874e+02 1.979e+02 3.129e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-12 00:06:35,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-10-12 00:06:36,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=876334.6666666666, ans=0.125 2023-10-12 00:06:46,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=876381.3333333334, ans=0.0 2023-10-12 00:06:50,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=876381.3333333334, ans=0.2 2023-10-12 00:07:01,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=876428.0, ans=0.125 2023-10-12 00:07:12,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=876474.6666666666, ans=0.0 2023-10-12 00:07:23,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=876521.3333333334, ans=0.125 2023-10-12 00:07:23,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=876521.3333333334, ans=0.0 2023-10-12 00:07:25,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=22.5 2023-10-12 00:07:28,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=876521.3333333334, ans=0.02 2023-10-12 00:07:43,218 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.90 vs. limit=6.0 2023-10-12 00:07:46,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=876614.6666666666, ans=0.125 2023-10-12 00:07:50,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=876614.6666666666, ans=0.125 2023-10-12 00:07:54,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-10-12 00:07:58,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=876661.3333333334, ans=0.125 2023-10-12 00:08:06,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.73 vs. limit=15.0 2023-10-12 00:08:14,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=876708.0, ans=0.125 2023-10-12 00:08:17,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=876708.0, ans=0.2 2023-10-12 00:08:33,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.702e+02 1.860e+02 2.093e+02 2.658e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-12 00:08:35,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=876801.3333333334, ans=0.05 2023-10-12 00:08:35,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=876801.3333333334, ans=15.0 2023-10-12 00:09:25,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=876988.0, ans=0.1 2023-10-12 00:09:25,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.68 vs. limit=15.0 2023-10-12 00:09:26,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-12 00:09:31,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=877034.6666666666, ans=0.125 2023-10-12 00:09:48,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=877081.3333333334, ans=0.125 2023-10-12 00:10:08,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=877128.0, ans=10.0 2023-10-12 00:10:11,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=877128.0, ans=0.0 2023-10-12 00:10:27,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-10-12 00:10:36,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.688e+02 1.868e+02 2.213e+02 3.138e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 00:10:39,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=877268.0, ans=0.0 2023-10-12 00:10:53,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=877314.6666666666, ans=0.2 2023-10-12 00:11:00,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877361.3333333334, ans=0.1 2023-10-12 00:11:06,885 INFO [train.py:1031] (3/4) Epoch 14, batch 10500, loss[loss=0.1947, simple_loss=0.2875, pruned_loss=0.0509, over 16261.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2876, pruned_loss=0.05384, over 32624887.44 frames. ], batch size: 50, lr: 2.49e-03, grad_scale: 16.0 2023-10-12 00:11:09,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=877408.0, ans=0.0 2023-10-12 00:11:13,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=877408.0, ans=0.125 2023-10-12 00:11:17,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=877454.6666666666, ans=0.125 2023-10-12 00:11:33,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=877501.3333333334, ans=0.2 2023-10-12 00:11:35,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=877501.3333333334, ans=0.0 2023-10-12 00:11:42,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=877548.0, ans=0.125 2023-10-12 00:11:46,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=877548.0, ans=0.125 2023-10-12 00:11:47,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877548.0, ans=0.1 2023-10-12 00:11:48,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=877548.0, ans=0.0 2023-10-12 00:11:57,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-10-12 00:12:00,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=877641.3333333334, ans=0.125 2023-10-12 00:12:10,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=877641.3333333334, ans=0.1 2023-10-12 00:12:18,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2023-10-12 00:12:20,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=877688.0, ans=0.0 2023-10-12 00:12:28,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=877734.6666666666, ans=0.125 2023-10-12 00:12:29,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.673e+02 1.835e+02 2.012e+02 2.804e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-12 00:12:50,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.58 vs. limit=22.5 2023-10-12 00:13:07,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=877828.0, ans=0.125 2023-10-12 00:13:14,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=877874.6666666666, ans=0.0 2023-10-12 00:13:38,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=877968.0, ans=0.125 2023-10-12 00:14:29,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-10-12 00:14:35,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.727e+02 1.906e+02 2.221e+02 3.186e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-12 00:15:16,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=878341.3333333334, ans=0.04949747468305833 2023-10-12 00:15:16,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=878341.3333333334, ans=15.0 2023-10-12 00:15:24,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=878388.0, ans=0.0 2023-10-12 00:15:28,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=878388.0, ans=0.125 2023-10-12 00:15:29,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=22.5 2023-10-12 00:15:32,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=878434.6666666666, ans=10.0 2023-10-12 00:15:46,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878481.3333333334, ans=0.1 2023-10-12 00:15:56,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=878481.3333333334, ans=0.125 2023-10-12 00:16:00,194 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:16:02,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=878528.0, ans=0.2 2023-10-12 00:16:02,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-10-12 00:16:08,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=878528.0, ans=0.0 2023-10-12 00:16:11,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=878574.6666666666, ans=0.125 2023-10-12 00:16:21,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=878574.6666666666, ans=0.125 2023-10-12 00:16:38,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.709e+02 1.892e+02 2.177e+02 3.803e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-12 00:16:51,716 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:16:59,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878761.3333333334, ans=0.1 2023-10-12 00:17:13,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=878808.0, ans=0.125 2023-10-12 00:17:36,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=878901.3333333334, ans=0.2 2023-10-12 00:17:40,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-10-12 00:17:48,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=878948.0, ans=0.07 2023-10-12 00:17:55,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.37 vs. limit=15.0 2023-10-12 00:18:08,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=879041.3333333334, ans=0.1 2023-10-12 00:18:25,173 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:18:33,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.755e+02 1.929e+02 2.238e+02 3.251e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 00:19:02,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=879228.0, ans=0.1 2023-10-12 00:19:22,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=879321.3333333334, ans=0.0 2023-10-12 00:19:24,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=879321.3333333334, ans=0.2 2023-10-12 00:19:37,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=879414.6666666666, ans=0.125 2023-10-12 00:20:01,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-12 00:20:24,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.693e+02 1.822e+02 2.046e+02 2.928e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-12 00:20:57,688 INFO [train.py:1031] (3/4) Epoch 14, batch 11000, loss[loss=0.1941, simple_loss=0.2826, pruned_loss=0.0528, over 16824.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2876, pruned_loss=0.05393, over 32646116.41 frames. ], batch size: 72, lr: 2.48e-03, grad_scale: 32.0 2023-10-12 00:21:30,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=879881.3333333334, ans=0.07 2023-10-12 00:21:54,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=879974.6666666666, ans=0.125 2023-10-12 00:21:55,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=879974.6666666666, ans=0.125 2023-10-12 00:21:55,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-10-12 00:22:06,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=879974.6666666666, ans=0.0 2023-10-12 00:22:23,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.872e+02 2.084e+02 2.378e+02 3.452e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-12 00:22:26,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=880068.0, ans=0.0 2023-10-12 00:22:34,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-10-12 00:23:03,658 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:23:12,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=880208.0, ans=0.125 2023-10-12 00:23:19,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=880254.6666666666, ans=0.0 2023-10-12 00:23:20,053 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:23:32,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=880301.3333333334, ans=0.125 2023-10-12 00:23:36,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-10-12 00:23:54,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=880394.6666666666, ans=0.0 2023-10-12 00:24:04,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=880441.3333333334, ans=0.125 2023-10-12 00:24:14,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=880488.0, ans=0.0 2023-10-12 00:24:28,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.680e+02 1.847e+02 2.090e+02 3.430e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-12 00:24:30,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=22.5 2023-10-12 00:24:34,814 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:25:03,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=880674.6666666666, ans=0.125 2023-10-12 00:25:10,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2023-10-12 00:25:11,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=880674.6666666666, ans=0.0 2023-10-12 00:25:15,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=880721.3333333334, ans=0.1 2023-10-12 00:25:16,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=880721.3333333334, ans=0.125 2023-10-12 00:25:17,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=880721.3333333334, ans=0.125 2023-10-12 00:25:18,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880721.3333333334, ans=0.1 2023-10-12 00:25:19,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2023-10-12 00:25:32,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=880768.0, ans=0.2 2023-10-12 00:25:37,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=880814.6666666666, ans=0.2 2023-10-12 00:26:09,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=880908.0, ans=0.0 2023-10-12 00:26:16,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.72 vs. limit=10.0 2023-10-12 00:26:30,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.724e+02 1.831e+02 2.113e+02 3.338e+02, threshold=3.662e+02, percent-clipped=0.0 2023-10-12 00:26:32,832 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:26:48,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=881048.0, ans=0.125 2023-10-12 00:26:57,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=881094.6666666666, ans=0.0 2023-10-12 00:27:17,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=881188.0, ans=0.07 2023-10-12 00:27:19,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=881188.0, ans=0.1 2023-10-12 00:27:21,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881188.0, ans=0.1 2023-10-12 00:27:45,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=881281.3333333334, ans=0.125 2023-10-12 00:28:04,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=881374.6666666666, ans=0.2 2023-10-12 00:28:10,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=881374.6666666666, ans=0.0 2023-10-12 00:28:24,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=881421.3333333334, ans=0.0 2023-10-12 00:28:31,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.677e+02 1.867e+02 2.089e+02 3.084e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-12 00:28:32,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=881468.0, ans=0.125 2023-10-12 00:28:49,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=881514.6666666666, ans=0.1 2023-10-12 00:29:35,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=881701.3333333334, ans=0.125 2023-10-12 00:29:50,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=881794.6666666666, ans=0.1 2023-10-12 00:30:21,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2023-10-12 00:30:22,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=881888.0, ans=0.1 2023-10-12 00:30:26,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881934.6666666666, ans=0.0 2023-10-12 00:30:27,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.797e+02 1.997e+02 2.220e+02 3.073e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-12 00:30:29,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=881934.6666666666, ans=0.5 2023-10-12 00:30:35,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=881981.3333333334, ans=0.125 2023-10-12 00:30:45,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=882028.0, ans=0.0 2023-10-12 00:30:48,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-12 00:30:50,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=882028.0, ans=0.0 2023-10-12 00:30:52,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882028.0, ans=0.1 2023-10-12 00:30:54,365 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=12.0 2023-10-12 00:30:56,691 INFO [train.py:1031] (3/4) Epoch 14, batch 11500, loss[loss=0.2059, simple_loss=0.3028, pruned_loss=0.05444, over 16911.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2873, pruned_loss=0.05385, over 32658301.87 frames. ], batch size: 123, lr: 2.48e-03, grad_scale: 32.0 2023-10-12 00:30:58,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=882074.6666666666, ans=0.1 2023-10-12 00:31:35,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.45 vs. limit=15.0 2023-10-12 00:31:49,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.64 vs. limit=15.0 2023-10-12 00:31:54,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882308.0, ans=0.1 2023-10-12 00:31:55,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=882308.0, ans=0.035 2023-10-12 00:32:02,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2023-10-12 00:32:06,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=882308.0, ans=0.125 2023-10-12 00:32:15,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=882354.6666666666, ans=0.2 2023-10-12 00:32:16,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=882354.6666666666, ans=0.0 2023-10-12 00:32:23,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.746e+02 1.940e+02 2.216e+02 3.510e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 00:32:26,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=882401.3333333334, ans=0.125 2023-10-12 00:33:12,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=882588.0, ans=0.125 2023-10-12 00:33:21,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=15.0 2023-10-12 00:33:21,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=882588.0, ans=0.0 2023-10-12 00:33:26,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.31 vs. limit=10.0 2023-10-12 00:33:41,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-10-12 00:33:46,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=882681.3333333334, ans=0.0 2023-10-12 00:34:02,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=882774.6666666666, ans=0.125 2023-10-12 00:34:05,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=882774.6666666666, ans=0.125 2023-10-12 00:34:18,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=882821.3333333334, ans=0.125 2023-10-12 00:34:22,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=882868.0, ans=0.0 2023-10-12 00:34:24,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.612e+02 1.779e+02 1.978e+02 2.703e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-12 00:34:31,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=882914.6666666666, ans=0.2 2023-10-12 00:34:31,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=882914.6666666666, ans=0.125 2023-10-12 00:34:43,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=882961.3333333334, ans=0.0 2023-10-12 00:34:51,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=882961.3333333334, ans=0.09899494936611666 2023-10-12 00:35:24,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=883101.3333333334, ans=0.1 2023-10-12 00:35:26,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=883148.0, ans=0.2 2023-10-12 00:35:28,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=883148.0, ans=0.125 2023-10-12 00:35:30,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=883148.0, ans=0.125 2023-10-12 00:35:30,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=883148.0, ans=0.125 2023-10-12 00:35:41,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=883194.6666666666, ans=0.035 2023-10-12 00:35:41,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=883194.6666666666, ans=0.04949747468305833 2023-10-12 00:36:16,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=883288.0, ans=0.125 2023-10-12 00:36:19,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=883288.0, ans=0.125 2023-10-12 00:36:27,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=883334.6666666666, ans=0.0 2023-10-12 00:36:28,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.684e+02 1.864e+02 2.084e+02 2.967e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-12 00:36:31,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=883334.6666666666, ans=0.1 2023-10-12 00:37:05,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=883474.6666666666, ans=0.125 2023-10-12 00:37:12,557 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.47 vs. limit=22.5 2023-10-12 00:37:19,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=883521.3333333334, ans=0.0 2023-10-12 00:37:47,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=883614.6666666666, ans=0.125 2023-10-12 00:38:03,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-10-12 00:38:18,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=883754.6666666666, ans=0.125 2023-10-12 00:38:28,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.726e+02 1.869e+02 2.062e+02 2.981e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-12 00:38:40,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=883848.0, ans=0.1 2023-10-12 00:38:44,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=883848.0, ans=0.2 2023-10-12 00:38:55,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=883894.6666666666, ans=0.0 2023-10-12 00:39:34,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884034.6666666666, ans=0.1 2023-10-12 00:39:41,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=884081.3333333334, ans=0.2 2023-10-12 00:39:50,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=884128.0, ans=0.125 2023-10-12 00:39:50,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=884128.0, ans=0.2 2023-10-12 00:40:18,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884221.3333333334, ans=0.1 2023-10-12 00:40:22,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-12 00:40:26,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=884268.0, ans=0.1 2023-10-12 00:40:29,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.236e+02 1.657e+02 1.815e+02 2.009e+02 2.623e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-12 00:40:29,872 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-10-12 00:40:46,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=884361.3333333334, ans=0.125 2023-10-12 00:40:46,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=884361.3333333334, ans=0.125 2023-10-12 00:40:55,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=884361.3333333334, ans=0.0 2023-10-12 00:40:56,899 INFO [train.py:1031] (3/4) Epoch 14, batch 12000, loss[loss=0.2017, simple_loss=0.2924, pruned_loss=0.05554, over 16938.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2873, pruned_loss=0.05354, over 32669018.60 frames. ], batch size: 77, lr: 2.48e-03, grad_scale: 16.0 2023-10-12 00:41:06,663 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:41:25,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=884501.3333333334, ans=0.2 2023-10-12 00:41:55,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2023-10-12 00:42:16,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=884688.0, ans=0.0 2023-10-12 00:42:21,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=884734.6666666666, ans=0.125 2023-10-12 00:42:25,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=12.0 2023-10-12 00:42:28,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.634e+02 1.824e+02 2.015e+02 2.957e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-12 00:42:34,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=884781.3333333334, ans=0.125 2023-10-12 00:42:46,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=884828.0, ans=0.09899494936611666 2023-10-12 00:43:06,993 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:43:21,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=884968.0, ans=0.025 2023-10-12 00:43:27,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=884968.0, ans=0.2 2023-10-12 00:43:38,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-10-12 00:43:39,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=885014.6666666666, ans=0.125 2023-10-12 00:43:59,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=885108.0, ans=0.5 2023-10-12 00:44:11,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=885154.6666666666, ans=0.1 2023-10-12 00:44:13,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=885154.6666666666, ans=0.0 2023-10-12 00:44:20,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.634e+02 1.794e+02 2.044e+02 3.142e+02, threshold=3.587e+02, percent-clipped=0.0 2023-10-12 00:44:22,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=885201.3333333334, ans=0.0 2023-10-12 00:44:43,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=885294.6666666666, ans=0.125 2023-10-12 00:44:52,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=885341.3333333334, ans=0.125 2023-10-12 00:44:54,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-10-12 00:45:03,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=885388.0, ans=0.2 2023-10-12 00:45:16,519 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-10-12 00:45:24,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=885481.3333333334, ans=0.125 2023-10-12 00:45:52,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=885574.6666666666, ans=0.0 2023-10-12 00:45:55,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=885621.3333333334, ans=0.125 2023-10-12 00:46:10,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.833e+02 2.055e+02 2.426e+02 4.382e+02, threshold=4.110e+02, percent-clipped=3.0 2023-10-12 00:46:20,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=885714.6666666666, ans=0.125 2023-10-12 00:46:38,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=885761.3333333334, ans=0.2 2023-10-12 00:46:54,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=885854.6666666666, ans=0.1 2023-10-12 00:46:56,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.38 vs. limit=10.0 2023-10-12 00:47:01,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.91 vs. limit=15.0 2023-10-12 00:47:04,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=885901.3333333334, ans=0.125 2023-10-12 00:47:05,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-12 00:47:16,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=885948.0, ans=0.1 2023-10-12 00:47:26,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=885994.6666666666, ans=0.0 2023-10-12 00:47:34,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=885994.6666666666, ans=0.04949747468305833 2023-10-12 00:47:52,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=886088.0, ans=0.125 2023-10-12 00:48:07,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.772e+02 1.986e+02 2.211e+02 3.059e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-12 00:48:18,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886181.3333333334, ans=0.1 2023-10-12 00:48:34,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=886274.6666666666, ans=0.125 2023-10-12 00:48:44,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=886321.3333333334, ans=0.125 2023-10-12 00:48:50,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886321.3333333334, ans=0.1 2023-10-12 00:49:20,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-10-12 00:49:35,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=22.5 2023-10-12 00:49:52,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=886554.6666666666, ans=0.0 2023-10-12 00:50:02,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.792e+02 1.959e+02 2.367e+02 3.427e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 00:50:11,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=886648.0, ans=0.125 2023-10-12 00:50:25,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=886694.6666666666, ans=0.125 2023-10-12 00:50:25,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=886694.6666666666, ans=0.2 2023-10-12 00:50:26,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=886694.6666666666, ans=0.125 2023-10-12 00:50:32,066 INFO [train.py:1031] (3/4) Epoch 14, batch 12500, loss[loss=0.2095, simple_loss=0.3033, pruned_loss=0.05788, over 16973.00 frames. ], tot_loss[loss=0.197, simple_loss=0.287, pruned_loss=0.05346, over 32704586.83 frames. ], batch size: 117, lr: 2.47e-03, grad_scale: 32.0 2023-10-12 00:50:40,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886741.3333333334, ans=0.1 2023-10-12 00:50:42,498 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:50:42,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=886788.0, ans=0.125 2023-10-12 00:50:43,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886788.0, ans=0.1 2023-10-12 00:50:44,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=886788.0, ans=0.125 2023-10-12 00:50:45,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=886788.0, ans=0.0 2023-10-12 00:51:33,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=886974.6666666666, ans=0.125 2023-10-12 00:51:34,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=886974.6666666666, ans=0.125 2023-10-12 00:51:38,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=887021.3333333334, ans=0.1 2023-10-12 00:51:43,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=887021.3333333334, ans=0.125 2023-10-12 00:51:47,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=887021.3333333334, ans=0.125 2023-10-12 00:51:47,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-10-12 00:51:53,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.700e+02 1.887e+02 2.071e+02 3.228e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-12 00:52:03,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=887114.6666666666, ans=0.125 2023-10-12 00:52:04,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=887114.6666666666, ans=0.1 2023-10-12 00:52:41,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=887254.6666666666, ans=0.05 2023-10-12 00:52:54,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=887301.3333333334, ans=0.125 2023-10-12 00:53:17,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=887394.6666666666, ans=0.125 2023-10-12 00:53:49,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=887534.6666666666, ans=10.0 2023-10-12 00:53:50,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.677e+02 1.838e+02 2.019e+02 2.930e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-12 00:53:56,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=887581.3333333334, ans=0.125 2023-10-12 00:53:57,043 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:54:00,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=887581.3333333334, ans=0.125 2023-10-12 00:54:13,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=887628.0, ans=0.0 2023-10-12 00:54:23,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=887674.6666666666, ans=0.125 2023-10-12 00:54:27,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=887721.3333333334, ans=0.1 2023-10-12 00:54:38,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=887768.0, ans=0.2 2023-10-12 00:54:39,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.84 vs. limit=15.0 2023-10-12 00:55:10,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-12 00:55:33,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=887954.6666666666, ans=0.125 2023-10-12 00:55:35,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-10-12 00:55:41,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.756e+02 1.960e+02 2.272e+02 3.286e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 00:55:49,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=888048.0, ans=0.2 2023-10-12 00:55:50,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=888048.0, ans=0.1 2023-10-12 00:56:02,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=888094.6666666666, ans=0.125 2023-10-12 00:56:13,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2023-10-12 00:56:15,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=888141.3333333334, ans=0.125 2023-10-12 00:56:18,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=888188.0, ans=0.125 2023-10-12 00:56:30,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-10-12 00:56:38,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=888234.6666666666, ans=0.125 2023-10-12 00:57:03,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=6.0 2023-10-12 00:57:06,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=888374.6666666666, ans=0.0 2023-10-12 00:57:28,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=888468.0, ans=0.125 2023-10-12 00:57:33,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.719e+02 1.917e+02 2.094e+02 3.080e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 00:58:01,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=888608.0, ans=0.125 2023-10-12 00:58:01,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=888608.0, ans=0.04949747468305833 2023-10-12 00:58:06,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=888608.0, ans=0.0 2023-10-12 00:58:42,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=888794.6666666666, ans=0.125 2023-10-12 00:58:48,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888794.6666666666, ans=0.1 2023-10-12 00:58:53,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=888794.6666666666, ans=0.5 2023-10-12 00:59:03,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=888841.3333333334, ans=0.125 2023-10-12 00:59:18,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=888934.6666666666, ans=0.125 2023-10-12 00:59:22,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.661e+02 1.820e+02 2.036e+02 2.890e+02, threshold=3.639e+02, percent-clipped=0.0 2023-10-12 00:59:25,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=888934.6666666666, ans=0.0 2023-10-12 00:59:33,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=888981.3333333334, ans=0.125 2023-10-12 00:59:39,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=889028.0, ans=0.125 2023-10-12 00:59:47,255 INFO [train.py:1031] (3/4) Epoch 14, batch 13000, loss[loss=0.1969, simple_loss=0.2554, pruned_loss=0.06918, over 12233.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2877, pruned_loss=0.05359, over 32728290.44 frames. ], batch size: 440, lr: 2.47e-03, grad_scale: 32.0 2023-10-12 00:59:57,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=889074.6666666666, ans=0.2 2023-10-12 00:59:59,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=889121.3333333334, ans=0.125 2023-10-12 01:00:19,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=889168.0, ans=0.1 2023-10-12 01:00:26,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=889214.6666666666, ans=0.125 2023-10-12 01:00:29,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=889214.6666666666, ans=0.02 2023-10-12 01:00:32,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=889214.6666666666, ans=0.04949747468305833 2023-10-12 01:00:39,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=889261.3333333334, ans=0.125 2023-10-12 01:00:54,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889308.0, ans=0.125 2023-10-12 01:01:05,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2023-10-12 01:01:11,969 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:01:13,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=889401.3333333334, ans=0.1 2023-10-12 01:01:18,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=889401.3333333334, ans=0.0 2023-10-12 01:01:21,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.683e+02 1.846e+02 2.093e+02 2.698e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-12 01:01:24,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=889401.3333333334, ans=0.04949747468305833 2023-10-12 01:01:33,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=889448.0, ans=0.125 2023-10-12 01:01:41,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=889494.6666666666, ans=0.125 2023-10-12 01:01:46,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=889494.6666666666, ans=0.95 2023-10-12 01:02:09,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=889588.0, ans=0.0 2023-10-12 01:02:18,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=889634.6666666666, ans=0.125 2023-10-12 01:02:18,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=889634.6666666666, ans=0.2 2023-10-12 01:02:26,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=889681.3333333334, ans=0.5 2023-10-12 01:02:42,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=889728.0, ans=0.125 2023-10-12 01:02:56,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=889821.3333333334, ans=0.125 2023-10-12 01:03:01,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=889821.3333333334, ans=0.1 2023-10-12 01:03:02,474 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.23 vs. limit=6.0 2023-10-12 01:03:07,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.70 vs. limit=15.0 2023-10-12 01:03:14,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.688e+02 1.906e+02 2.081e+02 2.924e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 01:03:31,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=889914.6666666666, ans=0.2 2023-10-12 01:03:36,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=889961.3333333334, ans=0.125 2023-10-12 01:03:36,463 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.73 vs. limit=10.0 2023-10-12 01:03:40,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=889961.3333333334, ans=0.0 2023-10-12 01:03:43,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889961.3333333334, ans=0.1 2023-10-12 01:03:50,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-12 01:03:54,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=890008.0, ans=0.125 2023-10-12 01:04:11,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.93 vs. limit=15.0 2023-10-12 01:04:19,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.84 vs. limit=22.5 2023-10-12 01:04:30,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-10-12 01:04:32,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=890194.6666666666, ans=0.125 2023-10-12 01:04:37,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=890194.6666666666, ans=0.95 2023-10-12 01:05:01,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=890288.0, ans=0.0 2023-10-12 01:05:13,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.749e+02 1.973e+02 2.249e+02 3.003e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-12 01:05:25,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=12.0 2023-10-12 01:05:42,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890474.6666666666, ans=0.1 2023-10-12 01:05:48,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=890474.6666666666, ans=0.1 2023-10-12 01:06:12,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.73 vs. limit=15.0 2023-10-12 01:06:24,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=890661.3333333334, ans=0.125 2023-10-12 01:06:32,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=890661.3333333334, ans=0.125 2023-10-12 01:06:34,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=890708.0, ans=0.125 2023-10-12 01:06:36,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-10-12 01:06:54,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=890754.6666666666, ans=0.125 2023-10-12 01:07:00,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=890801.3333333334, ans=0.125 2023-10-12 01:07:03,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-10-12 01:07:05,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.725e+02 1.866e+02 2.068e+02 2.711e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-12 01:07:08,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=890848.0, ans=0.125 2023-10-12 01:07:13,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=890848.0, ans=0.0 2023-10-12 01:07:17,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=890848.0, ans=0.0 2023-10-12 01:07:26,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=890894.6666666666, ans=0.0 2023-10-12 01:07:26,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.48 vs. limit=10.0 2023-10-12 01:07:52,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=890988.0, ans=0.125 2023-10-12 01:08:09,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=891081.3333333334, ans=0.0 2023-10-12 01:08:32,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=891174.6666666666, ans=0.0 2023-10-12 01:08:56,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.678e+02 1.829e+02 2.107e+02 2.945e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-12 01:08:58,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=891268.0, ans=0.125 2023-10-12 01:09:00,005 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:09:02,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=891314.6666666666, ans=10.0 2023-10-12 01:09:05,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.83 vs. limit=22.5 2023-10-12 01:09:08,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=891314.6666666666, ans=0.0 2023-10-12 01:09:23,032 INFO [train.py:1031] (3/4) Epoch 14, batch 13500, loss[loss=0.1921, simple_loss=0.2839, pruned_loss=0.05019, over 16962.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.287, pruned_loss=0.05354, over 32727100.39 frames. ], batch size: 110, lr: 2.47e-03, grad_scale: 16.0 2023-10-12 01:09:23,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=891408.0, ans=0.0 2023-10-12 01:09:28,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=891408.0, ans=0.0 2023-10-12 01:09:44,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=891501.3333333334, ans=0.125 2023-10-12 01:09:46,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=891501.3333333334, ans=0.125 2023-10-12 01:09:49,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=891501.3333333334, ans=0.07 2023-10-12 01:09:53,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=891501.3333333334, ans=0.125 2023-10-12 01:09:55,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.83 vs. limit=22.5 2023-10-12 01:10:09,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=891548.0, ans=0.0 2023-10-12 01:10:17,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891594.6666666666, ans=0.1 2023-10-12 01:10:22,620 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-10-12 01:10:30,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=891641.3333333334, ans=0.125 2023-10-12 01:10:52,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.702e+02 1.952e+02 2.238e+02 3.226e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 01:11:20,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.10 vs. limit=22.5 2023-10-12 01:11:25,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=891874.6666666666, ans=0.125 2023-10-12 01:11:34,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=891921.3333333334, ans=0.0 2023-10-12 01:11:46,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=891968.0, ans=0.125 2023-10-12 01:11:46,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-12 01:11:50,787 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:11:51,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=892014.6666666666, ans=0.2 2023-10-12 01:12:41,995 INFO [train.py:1031] (3/4) Epoch 15, batch 0, loss[loss=0.1774, simple_loss=0.2671, pruned_loss=0.04389, over 16534.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2671, pruned_loss=0.04389, over 16534.00 frames. ], batch size: 66, lr: 2.38e-03, grad_scale: 32.0 2023-10-12 01:12:41,996 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-12 01:12:50,459 INFO [train.py:1063] (3/4) Epoch 15, validation: loss=0.2176, simple_loss=0.3045, pruned_loss=0.06534, over 1020973.00 frames. 2023-10-12 01:12:50,459 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-12 01:12:51,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-10-12 01:12:58,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=892131.3333333334, ans=0.125 2023-10-12 01:13:14,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.767e+02 2.003e+02 2.262e+02 3.139e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-12 01:13:15,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=892224.6666666666, ans=0.2 2023-10-12 01:13:36,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=892318.0, ans=0.125 2023-10-12 01:13:50,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=892364.6666666666, ans=0.0 2023-10-12 01:14:30,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=892504.6666666666, ans=0.2 2023-10-12 01:14:42,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=892598.0, ans=0.2 2023-10-12 01:14:52,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=892598.0, ans=0.125 2023-10-12 01:14:59,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=892644.6666666666, ans=0.0 2023-10-12 01:15:06,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.659e+02 1.896e+02 2.141e+02 3.016e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 01:15:07,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=892691.3333333334, ans=0.125 2023-10-12 01:15:12,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.95 vs. limit=15.0 2023-10-12 01:16:04,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=892924.6666666666, ans=0.0 2023-10-12 01:16:13,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=892971.3333333334, ans=0.2 2023-10-12 01:16:21,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=892971.3333333334, ans=0.125 2023-10-12 01:16:39,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=893064.6666666666, ans=0.02 2023-10-12 01:16:47,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=893111.3333333334, ans=0.0 2023-10-12 01:16:59,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-12 01:17:00,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.761e+02 1.935e+02 2.147e+02 4.385e+02, threshold=3.870e+02, percent-clipped=1.0 2023-10-12 01:17:04,160 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.518e-03 2023-10-12 01:17:04,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=893158.0, ans=0.125 2023-10-12 01:17:04,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=893158.0, ans=0.125 2023-10-12 01:17:20,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=893251.3333333334, ans=0.0 2023-10-12 01:18:20,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=893438.0, ans=0.125 2023-10-12 01:18:21,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=893484.6666666666, ans=0.125 2023-10-12 01:18:46,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=893578.0, ans=0.125 2023-10-12 01:18:51,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=893578.0, ans=0.125 2023-10-12 01:18:52,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=893578.0, ans=0.0 2023-10-12 01:18:52,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=893578.0, ans=0.125 2023-10-12 01:18:58,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-10-12 01:18:59,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.692e+02 1.850e+02 2.090e+02 3.456e+02, threshold=3.699e+02, percent-clipped=0.0 2023-10-12 01:19:18,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=893718.0, ans=0.125 2023-10-12 01:19:38,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=12.0 2023-10-12 01:20:05,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=893904.6666666666, ans=0.015 2023-10-12 01:20:13,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=893951.3333333334, ans=0.125 2023-10-12 01:20:16,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=893951.3333333334, ans=0.125 2023-10-12 01:20:34,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=894044.6666666666, ans=0.125 2023-10-12 01:20:42,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=894044.6666666666, ans=0.125 2023-10-12 01:20:46,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.727e+02 1.933e+02 2.131e+02 3.001e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-12 01:21:09,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=894184.6666666666, ans=0.2 2023-10-12 01:21:35,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=894278.0, ans=0.0 2023-10-12 01:21:36,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=894278.0, ans=0.2 2023-10-12 01:21:45,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=894324.6666666666, ans=0.0 2023-10-12 01:21:55,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-10-12 01:22:04,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-10-12 01:22:14,580 INFO [train.py:1031] (3/4) Epoch 15, batch 500, loss[loss=0.2012, simple_loss=0.286, pruned_loss=0.05825, over 16944.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2877, pruned_loss=0.05371, over 7292650.30 frames. ], batch size: 110, lr: 2.38e-03, grad_scale: 32.0 2023-10-12 01:22:32,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894511.3333333334, ans=0.1 2023-10-12 01:22:40,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.749e+02 1.950e+02 2.198e+02 3.048e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-12 01:22:42,939 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:22:54,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=894604.6666666666, ans=22.5 2023-10-12 01:23:02,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=22.5 2023-10-12 01:23:03,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=894651.3333333334, ans=0.04949747468305833 2023-10-12 01:23:04,222 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:23:19,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=894744.6666666666, ans=0.05 2023-10-12 01:23:24,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=894744.6666666666, ans=0.125 2023-10-12 01:23:25,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=894744.6666666666, ans=0.1 2023-10-12 01:23:28,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=894744.6666666666, ans=0.125 2023-10-12 01:23:33,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=894791.3333333334, ans=0.0 2023-10-12 01:23:37,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-10-12 01:23:40,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=894838.0, ans=0.0 2023-10-12 01:24:01,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=894884.6666666666, ans=0.125 2023-10-12 01:24:11,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=894931.3333333334, ans=0.0 2023-10-12 01:24:27,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.771e+02 1.956e+02 2.138e+02 2.981e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-12 01:24:28,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=895024.6666666666, ans=0.125 2023-10-12 01:24:39,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=895071.3333333334, ans=0.1 2023-10-12 01:24:50,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=895118.0, ans=0.125 2023-10-12 01:24:57,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.24 vs. limit=15.0 2023-10-12 01:25:00,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=895164.6666666666, ans=0.05 2023-10-12 01:25:01,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=895164.6666666666, ans=0.125 2023-10-12 01:25:15,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-10-12 01:25:32,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=895304.6666666666, ans=0.07 2023-10-12 01:25:37,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.03 vs. limit=10.0 2023-10-12 01:25:38,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=895304.6666666666, ans=0.1 2023-10-12 01:25:42,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=895304.6666666666, ans=0.1 2023-10-12 01:25:42,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=895304.6666666666, ans=0.0 2023-10-12 01:25:46,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=895351.3333333334, ans=0.125 2023-10-12 01:25:54,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=895398.0, ans=0.125 2023-10-12 01:25:58,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=895398.0, ans=0.5 2023-10-12 01:26:00,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=895398.0, ans=0.0 2023-10-12 01:26:00,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=895398.0, ans=0.5 2023-10-12 01:26:10,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-12 01:26:11,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=895444.6666666666, ans=0.0 2023-10-12 01:26:18,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.850e+02 2.033e+02 2.192e+02 3.175e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-12 01:26:21,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=895491.3333333334, ans=0.2 2023-10-12 01:26:42,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=895584.6666666666, ans=0.0 2023-10-12 01:26:42,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=895584.6666666666, ans=0.2 2023-10-12 01:26:49,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=895584.6666666666, ans=0.125 2023-10-12 01:26:54,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=895631.3333333334, ans=0.0 2023-10-12 01:26:56,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.18 vs. limit=22.5 2023-10-12 01:26:58,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=895631.3333333334, ans=0.5 2023-10-12 01:26:58,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=895631.3333333334, ans=0.5 2023-10-12 01:26:59,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=895631.3333333334, ans=0.0 2023-10-12 01:27:10,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-10-12 01:27:25,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=895771.3333333334, ans=0.0 2023-10-12 01:27:27,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=895771.3333333334, ans=0.125 2023-10-12 01:27:28,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-10-12 01:27:40,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=895818.0, ans=0.125 2023-10-12 01:27:41,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=895818.0, ans=0.5 2023-10-12 01:27:56,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=895911.3333333334, ans=0.0 2023-10-12 01:27:57,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=895911.3333333334, ans=0.125 2023-10-12 01:28:09,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.690e+02 1.842e+02 2.084e+02 2.627e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-12 01:28:29,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=896004.6666666666, ans=0.125 2023-10-12 01:28:38,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-12 01:29:18,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=896191.3333333334, ans=0.125 2023-10-12 01:29:28,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=896238.0, ans=0.125 2023-10-12 01:29:28,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.01 vs. limit=15.0 2023-10-12 01:29:43,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=896331.3333333334, ans=0.0 2023-10-12 01:29:53,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-10-12 01:29:56,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896378.0, ans=0.1 2023-10-12 01:30:10,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.809e+02 2.060e+02 2.485e+02 3.626e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-12 01:30:11,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=896424.6666666666, ans=0.2 2023-10-12 01:30:50,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-10-12 01:30:53,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=896611.3333333334, ans=0.125 2023-10-12 01:31:01,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=896611.3333333334, ans=0.0 2023-10-12 01:31:33,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=896751.3333333334, ans=0.125 2023-10-12 01:31:37,363 INFO [train.py:1031] (3/4) Epoch 15, batch 1000, loss[loss=0.194, simple_loss=0.2861, pruned_loss=0.051, over 16827.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2878, pruned_loss=0.05411, over 12907716.31 frames. ], batch size: 98, lr: 2.37e-03, grad_scale: 32.0 2023-10-12 01:31:48,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=896844.6666666666, ans=0.125 2023-10-12 01:31:56,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=896844.6666666666, ans=0.1 2023-10-12 01:32:01,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.287e+02 1.658e+02 1.830e+02 2.117e+02 2.891e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-12 01:32:29,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=896984.6666666666, ans=0.125 2023-10-12 01:32:48,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=897078.0, ans=0.1 2023-10-12 01:32:57,283 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:33:01,896 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-10-12 01:33:09,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-10-12 01:33:56,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.722e+02 1.853e+02 2.064e+02 3.079e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-12 01:34:08,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=897404.6666666666, ans=0.0 2023-10-12 01:34:09,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=15.0 2023-10-12 01:34:19,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=897451.3333333334, ans=0.2 2023-10-12 01:34:21,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=22.5 2023-10-12 01:34:23,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=897451.3333333334, ans=0.125 2023-10-12 01:34:29,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.99 vs. limit=6.0 2023-10-12 01:34:31,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.67 vs. limit=6.0 2023-10-12 01:35:20,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=897684.6666666666, ans=0.1 2023-10-12 01:35:24,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.13 vs. limit=15.0 2023-10-12 01:35:29,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=897684.6666666666, ans=0.125 2023-10-12 01:35:29,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=897684.6666666666, ans=0.125 2023-10-12 01:35:36,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=897731.3333333334, ans=0.0 2023-10-12 01:35:40,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=897731.3333333334, ans=0.2 2023-10-12 01:35:40,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=897731.3333333334, ans=0.1 2023-10-12 01:35:43,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=897778.0, ans=0.125 2023-10-12 01:35:49,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=897778.0, ans=0.95 2023-10-12 01:35:58,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.678e+02 1.876e+02 2.275e+02 3.331e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-12 01:36:20,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=22.5 2023-10-12 01:36:50,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=898058.0, ans=0.125 2023-10-12 01:37:03,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-12 01:37:19,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=898198.0, ans=0.2 2023-10-12 01:37:27,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=898198.0, ans=0.2 2023-10-12 01:37:28,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=898198.0, ans=0.125 2023-10-12 01:37:46,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.738e+02 2.035e+02 2.186e+02 3.207e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-12 01:37:50,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=898291.3333333334, ans=0.2 2023-10-12 01:37:52,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=898338.0, ans=0.1 2023-10-12 01:37:58,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898338.0, ans=0.1 2023-10-12 01:38:21,887 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:38:24,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=898478.0, ans=0.0 2023-10-12 01:38:30,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=898478.0, ans=0.0 2023-10-12 01:38:32,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=898478.0, ans=0.05 2023-10-12 01:38:45,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.45 vs. limit=10.0 2023-10-12 01:39:25,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-10-12 01:39:26,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898711.3333333334, ans=0.1 2023-10-12 01:39:32,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=898758.0, ans=0.0 2023-10-12 01:39:33,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898758.0, ans=0.1 2023-10-12 01:39:37,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.707e+02 1.935e+02 2.195e+02 3.279e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 01:39:46,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.27 vs. limit=22.5 2023-10-12 01:39:58,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=898851.3333333334, ans=0.0 2023-10-12 01:39:59,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=898851.3333333334, ans=0.0 2023-10-12 01:40:15,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=898898.0, ans=0.125 2023-10-12 01:40:33,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=898991.3333333334, ans=0.125 2023-10-12 01:40:37,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=898991.3333333334, ans=0.0 2023-10-12 01:40:53,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=899084.6666666666, ans=0.1 2023-10-12 01:41:03,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=899131.3333333334, ans=0.125 2023-10-12 01:41:04,623 INFO [train.py:1031] (3/4) Epoch 15, batch 1500, loss[loss=0.1987, simple_loss=0.2924, pruned_loss=0.05256, over 16942.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2863, pruned_loss=0.05323, over 17310381.10 frames. ], batch size: 123, lr: 2.37e-03, grad_scale: 32.0 2023-10-12 01:41:04,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=899131.3333333334, ans=0.125 2023-10-12 01:41:19,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=899178.0, ans=0.0 2023-10-12 01:41:30,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.757e+02 1.959e+02 2.168e+02 2.857e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 01:41:33,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=899224.6666666666, ans=0.2 2023-10-12 01:42:03,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=899364.6666666666, ans=0.1 2023-10-12 01:42:06,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.62 vs. limit=15.0 2023-10-12 01:42:10,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=899364.6666666666, ans=0.2 2023-10-12 01:42:14,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-10-12 01:42:20,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=899411.3333333334, ans=0.1 2023-10-12 01:42:21,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=899411.3333333334, ans=0.2 2023-10-12 01:42:38,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=899504.6666666666, ans=0.2 2023-10-12 01:42:43,145 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:42:46,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=899504.6666666666, ans=0.5 2023-10-12 01:43:02,474 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:43:06,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=899598.0, ans=0.125 2023-10-12 01:43:07,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-12 01:43:26,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.720e+02 1.876e+02 2.093e+02 3.022e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-12 01:43:40,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=899738.0, ans=0.0 2023-10-12 01:43:54,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-10-12 01:44:06,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=899831.3333333334, ans=0.025 2023-10-12 01:44:07,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=899831.3333333334, ans=0.5 2023-10-12 01:44:09,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=899831.3333333334, ans=0.1 2023-10-12 01:44:10,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=899831.3333333334, ans=0.05 2023-10-12 01:44:47,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=899971.3333333334, ans=0.05 2023-10-12 01:44:56,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=900018.0, ans=0.0 2023-10-12 01:45:11,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=900111.3333333334, ans=0.125 2023-10-12 01:45:13,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.30 vs. limit=15.0 2023-10-12 01:45:18,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=900111.3333333334, ans=0.0 2023-10-12 01:45:25,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.691e+02 1.868e+02 2.093e+02 3.176e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 01:45:37,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=900204.6666666666, ans=0.125 2023-10-12 01:45:44,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=900251.3333333334, ans=0.125 2023-10-12 01:45:48,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=900251.3333333334, ans=0.0 2023-10-12 01:45:49,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=900251.3333333334, ans=0.0 2023-10-12 01:45:56,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.24 vs. limit=15.0 2023-10-12 01:46:38,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-10-12 01:46:45,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=900484.6666666666, ans=0.125 2023-10-12 01:46:48,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=900484.6666666666, ans=0.125 2023-10-12 01:46:51,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-10-12 01:46:54,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=900531.3333333334, ans=0.0 2023-10-12 01:47:07,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=900578.0, ans=0.0 2023-10-12 01:47:08,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=900578.0, ans=0.0 2023-10-12 01:47:19,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=900624.6666666666, ans=0.125 2023-10-12 01:47:22,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.742e+02 1.876e+02 1.985e+02 2.725e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 01:47:30,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=900671.3333333334, ans=0.125 2023-10-12 01:47:30,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=900671.3333333334, ans=0.0 2023-10-12 01:47:54,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=900764.6666666666, ans=0.125 2023-10-12 01:48:06,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=900811.3333333334, ans=0.09899494936611666 2023-10-12 01:48:20,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=900858.0, ans=0.2 2023-10-12 01:48:22,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=900858.0, ans=0.125 2023-10-12 01:48:29,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-12 01:48:33,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=900904.6666666666, ans=0.125 2023-10-12 01:48:46,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=900951.3333333334, ans=0.125 2023-10-12 01:48:54,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=900998.0, ans=0.2 2023-10-12 01:48:57,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=900998.0, ans=0.125 2023-10-12 01:49:11,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=901091.3333333334, ans=0.0 2023-10-12 01:49:12,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=901091.3333333334, ans=0.0 2023-10-12 01:49:14,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.864e+02 2.103e+02 2.403e+02 3.132e+02, threshold=4.206e+02, percent-clipped=0.0 2023-10-12 01:49:58,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=901231.3333333334, ans=0.125 2023-10-12 01:50:32,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-10-12 01:50:42,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-10-12 01:50:52,215 INFO [train.py:1031] (3/4) Epoch 15, batch 2000, loss[loss=0.1582, simple_loss=0.2606, pruned_loss=0.02785, over 16889.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2869, pruned_loss=0.05332, over 20736623.64 frames. ], batch size: 104, lr: 2.37e-03, grad_scale: 32.0 2023-10-12 01:51:17,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=901558.0, ans=0.125 2023-10-12 01:51:21,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.714e+02 1.842e+02 2.115e+02 2.780e+02, threshold=3.684e+02, percent-clipped=0.0 2023-10-12 01:51:52,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=901651.3333333334, ans=0.0 2023-10-12 01:52:08,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=901698.0, ans=0.2 2023-10-12 01:52:10,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=901744.6666666666, ans=0.0 2023-10-12 01:52:26,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-12 01:52:31,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=901791.3333333334, ans=0.09899494936611666 2023-10-12 01:52:32,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=901791.3333333334, ans=0.125 2023-10-12 01:52:48,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=901884.6666666666, ans=0.2 2023-10-12 01:52:57,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=901884.6666666666, ans=0.07 2023-10-12 01:53:08,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=901931.3333333334, ans=0.0 2023-10-12 01:53:44,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.644e+02 1.803e+02 2.040e+02 3.306e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-12 01:54:02,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=902071.3333333334, ans=0.125 2023-10-12 01:54:06,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.13 vs. limit=15.0 2023-10-12 01:54:10,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.53 vs. limit=10.0 2023-10-12 01:54:19,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902164.6666666666, ans=0.1 2023-10-12 01:54:39,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=902211.3333333334, ans=0.1 2023-10-12 01:54:46,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=22.5 2023-10-12 01:54:54,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=902304.6666666666, ans=0.0 2023-10-12 01:55:11,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=902351.3333333334, ans=0.125 2023-10-12 01:55:25,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=902444.6666666666, ans=0.1 2023-10-12 01:55:26,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902444.6666666666, ans=0.1 2023-10-12 01:55:33,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=902444.6666666666, ans=0.125 2023-10-12 01:55:40,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.780e+02 1.980e+02 2.282e+02 3.090e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-12 01:55:48,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=902538.0, ans=0.125 2023-10-12 01:55:49,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=902538.0, ans=0.0 2023-10-12 01:55:53,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=12.0 2023-10-12 01:56:22,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=902678.0, ans=0.125 2023-10-12 01:56:25,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=902678.0, ans=0.1 2023-10-12 01:56:26,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=902678.0, ans=0.125 2023-10-12 01:56:27,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=902678.0, ans=0.0 2023-10-12 01:56:34,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=902724.6666666666, ans=0.125 2023-10-12 01:56:46,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=902771.3333333334, ans=0.95 2023-10-12 01:56:48,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=22.5 2023-10-12 01:57:18,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=902911.3333333334, ans=0.0 2023-10-12 01:57:30,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.751e+02 1.962e+02 2.180e+02 3.937e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-12 01:57:34,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=902958.0, ans=0.125 2023-10-12 01:57:36,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-10-12 01:57:38,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=903004.6666666666, ans=0.125 2023-10-12 01:58:15,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=903144.6666666666, ans=0.1 2023-10-12 01:58:37,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=903238.0, ans=0.125 2023-10-12 01:58:43,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=903238.0, ans=0.125 2023-10-12 01:58:46,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-10-12 01:58:47,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=903284.6666666666, ans=0.0 2023-10-12 01:58:58,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=903331.3333333334, ans=0.125 2023-10-12 01:58:59,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=903331.3333333334, ans=0.0 2023-10-12 01:59:04,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-12 01:59:11,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=903378.0, ans=0.1 2023-10-12 01:59:15,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=903378.0, ans=0.0 2023-10-12 01:59:18,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=22.5 2023-10-12 01:59:20,958 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=15.0 2023-10-12 01:59:22,287 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:59:22,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=903424.6666666666, ans=0.125 2023-10-12 01:59:23,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.747e+02 1.925e+02 2.113e+02 2.789e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-12 01:59:51,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=903564.6666666666, ans=0.035 2023-10-12 02:00:00,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=903611.3333333334, ans=0.09899494936611666 2023-10-12 02:00:08,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=903611.3333333334, ans=0.125 2023-10-12 02:00:12,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=903658.0, ans=0.0 2023-10-12 02:00:16,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=903658.0, ans=0.0 2023-10-12 02:00:23,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=903704.6666666666, ans=0.125 2023-10-12 02:00:25,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=903704.6666666666, ans=0.125 2023-10-12 02:00:29,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=903704.6666666666, ans=0.125 2023-10-12 02:00:49,010 INFO [train.py:1031] (3/4) Epoch 15, batch 2500, loss[loss=0.187, simple_loss=0.2826, pruned_loss=0.04564, over 16473.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2871, pruned_loss=0.05344, over 23416620.01 frames. ], batch size: 50, lr: 2.36e-03, grad_scale: 32.0 2023-10-12 02:00:59,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=903844.6666666666, ans=0.07 2023-10-12 02:01:10,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=903891.3333333334, ans=0.125 2023-10-12 02:01:14,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.720e+02 1.887e+02 2.148e+02 2.728e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-12 02:01:19,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=903938.0, ans=0.125 2023-10-12 02:01:24,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-12 02:01:25,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=903938.0, ans=0.125 2023-10-12 02:01:28,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=903938.0, ans=0.0 2023-10-12 02:01:42,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.97 vs. limit=15.0 2023-10-12 02:02:32,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=904218.0, ans=0.125 2023-10-12 02:02:34,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=904218.0, ans=0.0 2023-10-12 02:02:52,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=904311.3333333334, ans=0.0 2023-10-12 02:03:06,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.753e+02 1.966e+02 2.340e+02 3.159e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-12 02:03:09,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=904358.0, ans=0.125 2023-10-12 02:03:10,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=904404.6666666666, ans=0.0 2023-10-12 02:03:12,681 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:03:13,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=904404.6666666666, ans=0.125 2023-10-12 02:03:16,837 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-10-12 02:03:23,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=904451.3333333334, ans=0.2 2023-10-12 02:03:25,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=904451.3333333334, ans=0.0 2023-10-12 02:03:31,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=904498.0, ans=0.0 2023-10-12 02:03:45,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=904544.6666666666, ans=0.0 2023-10-12 02:03:59,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904591.3333333334, ans=0.1 2023-10-12 02:04:03,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=904638.0, ans=0.125 2023-10-12 02:04:04,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=904638.0, ans=0.125 2023-10-12 02:04:27,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=904731.3333333334, ans=0.0 2023-10-12 02:04:33,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=904731.3333333334, ans=0.125 2023-10-12 02:04:42,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.97 vs. limit=22.5 2023-10-12 02:04:49,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2023-10-12 02:04:55,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-10-12 02:04:59,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.675e+02 1.855e+02 2.029e+02 2.893e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-12 02:05:05,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=904871.3333333334, ans=0.0 2023-10-12 02:05:08,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=904871.3333333334, ans=0.0 2023-10-12 02:05:50,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905011.3333333334, ans=0.1 2023-10-12 02:06:00,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=905058.0, ans=0.2 2023-10-12 02:06:16,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=905151.3333333334, ans=0.125 2023-10-12 02:06:25,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=905151.3333333334, ans=0.125 2023-10-12 02:06:35,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=905198.0, ans=0.125 2023-10-12 02:06:48,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=905244.6666666666, ans=0.09899494936611666 2023-10-12 02:07:00,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.628e+02 1.792e+02 1.996e+02 3.622e+02, threshold=3.584e+02, percent-clipped=0.0 2023-10-12 02:07:10,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=12.0 2023-10-12 02:07:21,599 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.29 vs. limit=15.0 2023-10-12 02:07:22,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=905384.6666666666, ans=0.09899494936611666 2023-10-12 02:07:26,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-10-12 02:07:43,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=905478.0, ans=0.2 2023-10-12 02:07:44,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-10-12 02:07:46,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=905478.0, ans=0.0 2023-10-12 02:08:05,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=905524.6666666666, ans=0.2 2023-10-12 02:08:35,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=905664.6666666666, ans=0.125 2023-10-12 02:08:43,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=905664.6666666666, ans=0.125 2023-10-12 02:09:03,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=905758.0, ans=0.125 2023-10-12 02:09:05,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.748e+02 1.913e+02 2.184e+02 3.105e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-12 02:09:17,519 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-10-12 02:09:32,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=905898.0, ans=0.125 2023-10-12 02:09:49,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.92 vs. limit=15.0 2023-10-12 02:09:51,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=905944.6666666666, ans=0.0 2023-10-12 02:10:16,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.78 vs. limit=10.0 2023-10-12 02:10:24,463 INFO [train.py:1031] (3/4) Epoch 15, batch 3000, loss[loss=0.1951, simple_loss=0.2887, pruned_loss=0.05075, over 16941.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2864, pruned_loss=0.05351, over 25467773.30 frames. ], batch size: 138, lr: 2.36e-03, grad_scale: 16.0 2023-10-12 02:10:39,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=906178.0, ans=0.07 2023-10-12 02:10:53,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906224.6666666666, ans=0.1 2023-10-12 02:10:53,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.740e+02 1.984e+02 2.216e+02 3.623e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-12 02:10:59,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=906271.3333333334, ans=0.04949747468305833 2023-10-12 02:11:07,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=906318.0, ans=0.125 2023-10-12 02:11:29,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-12 02:11:31,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=906364.6666666666, ans=0.2 2023-10-12 02:11:42,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=906411.3333333334, ans=0.125 2023-10-12 02:12:07,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=906551.3333333334, ans=0.0 2023-10-12 02:12:10,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=906551.3333333334, ans=0.125 2023-10-12 02:12:13,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=906551.3333333334, ans=0.125 2023-10-12 02:12:23,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=906598.0, ans=0.1 2023-10-12 02:12:47,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=906691.3333333334, ans=0.0 2023-10-12 02:12:50,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-10-12 02:12:54,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.816e+02 1.994e+02 2.299e+02 3.375e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-12 02:13:07,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.57 vs. limit=12.0 2023-10-12 02:13:14,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=906784.6666666666, ans=0.1 2023-10-12 02:13:15,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=906784.6666666666, ans=0.0 2023-10-12 02:13:17,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=906784.6666666666, ans=0.05 2023-10-12 02:13:34,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=906878.0, ans=0.125 2023-10-12 02:13:53,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=906971.3333333334, ans=0.125 2023-10-12 02:14:03,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=907018.0, ans=0.0 2023-10-12 02:14:08,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-12 02:14:11,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=907018.0, ans=0.125 2023-10-12 02:14:26,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=907111.3333333334, ans=0.1 2023-10-12 02:14:39,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=907158.0, ans=0.125 2023-10-12 02:14:50,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.660e+02 1.817e+02 2.016e+02 3.077e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-12 02:14:54,134 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:14:58,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=907204.6666666666, ans=0.0 2023-10-12 02:15:00,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=907204.6666666666, ans=0.0 2023-10-12 02:15:10,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=907251.3333333334, ans=0.035 2023-10-12 02:15:43,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-10-12 02:15:49,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=907391.3333333334, ans=0.125 2023-10-12 02:15:56,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=907438.0, ans=0.0 2023-10-12 02:16:21,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=907531.3333333334, ans=0.125 2023-10-12 02:16:46,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=907624.6666666666, ans=0.0 2023-10-12 02:16:47,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.697e+02 1.826e+02 2.010e+02 2.962e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-12 02:16:52,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=907671.3333333334, ans=0.125 2023-10-12 02:16:53,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=907671.3333333334, ans=10.0 2023-10-12 02:16:58,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=907671.3333333334, ans=0.125 2023-10-12 02:17:07,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=907718.0, ans=0.0 2023-10-12 02:17:26,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-10-12 02:17:41,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-10-12 02:17:55,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=907904.6666666666, ans=0.0 2023-10-12 02:18:13,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-10-12 02:18:29,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=908044.6666666666, ans=0.2 2023-10-12 02:18:35,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=908091.3333333334, ans=0.125 2023-10-12 02:18:45,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.766e+02 1.959e+02 2.298e+02 3.074e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 02:18:49,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-10-12 02:18:52,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=908138.0, ans=0.5 2023-10-12 02:19:09,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=908231.3333333334, ans=0.0 2023-10-12 02:19:24,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-10-12 02:19:44,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=22.5 2023-10-12 02:19:59,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.38 vs. limit=12.0 2023-10-12 02:20:05,087 INFO [train.py:1031] (3/4) Epoch 15, batch 3500, loss[loss=0.1983, simple_loss=0.2934, pruned_loss=0.05159, over 16912.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2864, pruned_loss=0.05375, over 27063719.00 frames. ], batch size: 138, lr: 2.36e-03, grad_scale: 16.0 2023-10-12 02:20:30,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.74 vs. limit=10.0 2023-10-12 02:20:32,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=908558.0, ans=0.1 2023-10-12 02:20:33,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=908558.0, ans=0.0 2023-10-12 02:20:34,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.775e+02 1.931e+02 2.165e+02 3.104e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-12 02:20:35,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=908558.0, ans=0.125 2023-10-12 02:20:58,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=908651.3333333334, ans=0.0 2023-10-12 02:21:03,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=908698.0, ans=0.0 2023-10-12 02:21:05,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=908698.0, ans=0.0 2023-10-12 02:21:10,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=908744.6666666666, ans=0.125 2023-10-12 02:21:33,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-10-12 02:21:44,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=908838.0, ans=0.125 2023-10-12 02:21:45,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=908838.0, ans=0.05 2023-10-12 02:22:18,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=908978.0, ans=0.125 2023-10-12 02:22:33,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.704e+02 1.840e+02 2.033e+02 2.783e+02, threshold=3.679e+02, percent-clipped=0.0 2023-10-12 02:22:35,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-10-12 02:22:48,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=909118.0, ans=0.125 2023-10-12 02:22:58,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-10-12 02:23:11,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=909211.3333333334, ans=0.0 2023-10-12 02:23:28,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-10-12 02:23:31,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=909258.0, ans=0.95 2023-10-12 02:23:32,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=909258.0, ans=0.125 2023-10-12 02:24:12,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=909444.6666666666, ans=0.125 2023-10-12 02:24:34,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.619e+02 1.761e+02 2.000e+02 3.477e+02, threshold=3.522e+02, percent-clipped=0.0 2023-10-12 02:24:37,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=909538.0, ans=0.125 2023-10-12 02:24:45,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-10-12 02:25:27,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909724.6666666666, ans=0.1 2023-10-12 02:25:27,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=909724.6666666666, ans=0.1 2023-10-12 02:25:28,822 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:25:39,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=909771.3333333334, ans=0.0 2023-10-12 02:25:45,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909771.3333333334, ans=0.1 2023-10-12 02:25:50,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=909818.0, ans=0.2 2023-10-12 02:26:06,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=909864.6666666666, ans=0.125 2023-10-12 02:26:09,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=909864.6666666666, ans=0.125 2023-10-12 02:26:28,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=909958.0, ans=0.125 2023-10-12 02:26:33,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.867e+02 2.074e+02 2.353e+02 3.207e+02, threshold=4.148e+02, percent-clipped=0.0 2023-10-12 02:26:40,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910004.6666666666, ans=0.1 2023-10-12 02:26:56,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=910098.0, ans=0.125 2023-10-12 02:26:57,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-10-12 02:27:11,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=910144.6666666666, ans=0.125 2023-10-12 02:27:15,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=910144.6666666666, ans=0.2 2023-10-12 02:27:18,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=910144.6666666666, ans=0.0 2023-10-12 02:27:22,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=910191.3333333334, ans=0.125 2023-10-12 02:27:32,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=910238.0, ans=0.07 2023-10-12 02:27:34,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-10-12 02:27:49,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=910284.6666666666, ans=0.0 2023-10-12 02:27:50,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=910284.6666666666, ans=0.125 2023-10-12 02:27:56,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=910331.3333333334, ans=0.0 2023-10-12 02:27:57,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=910331.3333333334, ans=0.0 2023-10-12 02:28:20,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910424.6666666666, ans=0.1 2023-10-12 02:28:23,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.713e+02 1.882e+02 2.049e+02 3.259e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 02:28:39,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=15.0 2023-10-12 02:28:45,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=910518.0, ans=0.2 2023-10-12 02:28:46,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=910564.6666666666, ans=0.0 2023-10-12 02:28:49,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-10-12 02:28:52,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=22.5 2023-10-12 02:29:00,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=910611.3333333334, ans=0.125 2023-10-12 02:29:10,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=910611.3333333334, ans=15.0 2023-10-12 02:29:22,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-10-12 02:29:28,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=910704.6666666666, ans=0.125 2023-10-12 02:29:36,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=910751.3333333334, ans=0.0 2023-10-12 02:29:38,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=910751.3333333334, ans=0.125 2023-10-12 02:29:46,015 INFO [train.py:1031] (3/4) Epoch 15, batch 4000, loss[loss=0.251, simple_loss=0.3289, pruned_loss=0.08656, over 15578.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.286, pruned_loss=0.05377, over 28328551.12 frames. ], batch size: 350, lr: 2.35e-03, grad_scale: 32.0 2023-10-12 02:29:46,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=910798.0, ans=0.0 2023-10-12 02:29:48,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=910798.0, ans=0.125 2023-10-12 02:29:48,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=910798.0, ans=0.5 2023-10-12 02:30:03,319 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2023-10-12 02:30:05,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=910844.6666666666, ans=0.125 2023-10-12 02:30:10,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=910891.3333333334, ans=0.125 2023-10-12 02:30:20,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.688e+02 1.861e+02 2.085e+02 3.110e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-12 02:30:22,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910938.0, ans=0.1 2023-10-12 02:30:26,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.07 vs. limit=15.0 2023-10-12 02:30:41,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=910984.6666666666, ans=0.0 2023-10-12 02:30:51,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=911031.3333333334, ans=0.0 2023-10-12 02:31:09,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=911124.6666666666, ans=0.0 2023-10-12 02:31:12,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=911124.6666666666, ans=0.125 2023-10-12 02:31:17,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=911171.3333333334, ans=0.125 2023-10-12 02:31:42,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-10-12 02:31:46,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.24 vs. limit=10.0 2023-10-12 02:31:46,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=12.0 2023-10-12 02:31:59,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=911311.3333333334, ans=0.0 2023-10-12 02:32:14,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.722e+02 1.851e+02 2.066e+02 3.254e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-12 02:32:19,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=911404.6666666666, ans=0.2 2023-10-12 02:32:19,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-10-12 02:32:40,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.59 vs. limit=22.5 2023-10-12 02:32:50,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=911544.6666666666, ans=0.125 2023-10-12 02:33:00,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2023-10-12 02:33:08,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=911591.3333333334, ans=0.1 2023-10-12 02:33:31,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=911638.0, ans=0.125 2023-10-12 02:33:34,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=911638.0, ans=0.125 2023-10-12 02:34:06,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=911778.0, ans=0.2 2023-10-12 02:34:13,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=911824.6666666666, ans=0.125 2023-10-12 02:34:15,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=911824.6666666666, ans=0.0 2023-10-12 02:34:16,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=911824.6666666666, ans=0.2 2023-10-12 02:34:23,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.663e+02 1.788e+02 2.049e+02 3.231e+02, threshold=3.577e+02, percent-clipped=0.0 2023-10-12 02:34:25,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=911871.3333333334, ans=0.1 2023-10-12 02:34:28,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=911871.3333333334, ans=0.125 2023-10-12 02:34:44,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=911918.0, ans=0.2 2023-10-12 02:34:53,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=911964.6666666666, ans=0.125 2023-10-12 02:34:56,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=912011.3333333334, ans=0.0 2023-10-12 02:35:07,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=912058.0, ans=0.0 2023-10-12 02:35:10,374 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:35:27,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=912104.6666666666, ans=0.125 2023-10-12 02:35:47,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=912198.0, ans=0.125 2023-10-12 02:35:47,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=12.0 2023-10-12 02:36:01,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-10-12 02:36:13,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.709e+02 1.935e+02 2.177e+02 3.886e+02, threshold=3.870e+02, percent-clipped=2.0 2023-10-12 02:36:37,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=912384.6666666666, ans=0.125 2023-10-12 02:36:40,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=912431.3333333334, ans=0.0 2023-10-12 02:36:41,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-10-12 02:36:42,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=912431.3333333334, ans=0.07 2023-10-12 02:36:49,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2023-10-12 02:37:34,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=912664.6666666666, ans=0.125 2023-10-12 02:37:34,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-10-12 02:37:48,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=912711.3333333334, ans=0.0 2023-10-12 02:38:08,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.815e+02 1.963e+02 2.194e+02 3.165e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-12 02:38:17,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=912804.6666666666, ans=0.125 2023-10-12 02:38:32,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=912851.3333333334, ans=0.125 2023-10-12 02:38:37,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912851.3333333334, ans=0.1 2023-10-12 02:39:09,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=912991.3333333334, ans=0.0 2023-10-12 02:39:10,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=912991.3333333334, ans=0.125 2023-10-12 02:39:26,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=913084.6666666666, ans=0.09899494936611666 2023-10-12 02:39:40,356 INFO [train.py:1031] (3/4) Epoch 15, batch 4500, loss[loss=0.164, simple_loss=0.2398, pruned_loss=0.04403, over 13039.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2862, pruned_loss=0.05337, over 29322416.36 frames. ], batch size: 440, lr: 2.35e-03, grad_scale: 32.0 2023-10-12 02:39:46,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=913131.3333333334, ans=0.025 2023-10-12 02:40:12,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.653e+02 1.778e+02 1.977e+02 2.656e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-12 02:40:16,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=913271.3333333334, ans=0.2 2023-10-12 02:40:40,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=22.5 2023-10-12 02:40:58,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=913458.0, ans=0.0 2023-10-12 02:41:22,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=913551.3333333334, ans=0.125 2023-10-12 02:41:27,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=913551.3333333334, ans=0.1 2023-10-12 02:41:31,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=913598.0, ans=0.125 2023-10-12 02:41:49,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=913644.6666666666, ans=0.125 2023-10-12 02:41:52,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=913691.3333333334, ans=0.0 2023-10-12 02:41:58,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=913691.3333333334, ans=0.1 2023-10-12 02:41:59,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=913691.3333333334, ans=0.125 2023-10-12 02:42:01,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.762e+02 1.937e+02 2.212e+02 2.712e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 02:42:07,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=913738.0, ans=0.0 2023-10-12 02:42:16,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=913784.6666666666, ans=0.2 2023-10-12 02:42:25,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=913831.3333333334, ans=0.1 2023-10-12 02:42:25,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=913831.3333333334, ans=0.2 2023-10-12 02:42:28,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=913831.3333333334, ans=0.0 2023-10-12 02:42:29,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=913831.3333333334, ans=0.125 2023-10-12 02:43:04,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=913971.3333333334, ans=0.2 2023-10-12 02:43:05,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=913971.3333333334, ans=0.125 2023-10-12 02:43:13,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=914018.0, ans=0.125 2023-10-12 02:43:38,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=914111.3333333334, ans=0.0 2023-10-12 02:43:44,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=914158.0, ans=0.125 2023-10-12 02:43:50,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.663e+02 1.861e+02 2.060e+02 4.129e+02, threshold=3.722e+02, percent-clipped=1.0 2023-10-12 02:44:02,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=914251.3333333334, ans=0.0 2023-10-12 02:44:07,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=914251.3333333334, ans=0.125 2023-10-12 02:44:10,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=914251.3333333334, ans=0.125 2023-10-12 02:44:19,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=914298.0, ans=0.125 2023-10-12 02:44:26,962 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:44:30,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914344.6666666666, ans=0.1 2023-10-12 02:44:40,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=12.0 2023-10-12 02:44:56,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914484.6666666666, ans=0.1 2023-10-12 02:45:14,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914531.3333333334, ans=0.1 2023-10-12 02:45:30,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-10-12 02:45:41,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914624.6666666666, ans=0.1 2023-10-12 02:45:43,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=15.0 2023-10-12 02:45:45,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.704e+02 1.897e+02 2.105e+02 2.995e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-12 02:45:46,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=914671.3333333334, ans=0.125 2023-10-12 02:46:09,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.20 vs. limit=10.0 2023-10-12 02:46:25,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.31 vs. limit=15.0 2023-10-12 02:46:44,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=914904.6666666666, ans=0.1 2023-10-12 02:46:54,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=914951.3333333334, ans=0.0 2023-10-12 02:47:17,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=914998.0, ans=0.025 2023-10-12 02:47:29,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=915044.6666666666, ans=0.125 2023-10-12 02:47:39,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=915091.3333333334, ans=0.2 2023-10-12 02:47:44,022 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.682e+02 1.859e+02 2.153e+02 2.870e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-12 02:47:53,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=915138.0, ans=0.125 2023-10-12 02:48:04,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.08 vs. limit=10.0 2023-10-12 02:48:09,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=915231.3333333334, ans=0.2 2023-10-12 02:48:13,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=915231.3333333334, ans=0.125 2023-10-12 02:48:16,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=915231.3333333334, ans=0.0 2023-10-12 02:48:24,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=915278.0, ans=0.0 2023-10-12 02:48:34,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=12.0 2023-10-12 02:48:56,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=915418.0, ans=0.125 2023-10-12 02:49:06,551 INFO [train.py:1031] (3/4) Epoch 15, batch 5000, loss[loss=0.2034, simple_loss=0.2862, pruned_loss=0.06028, over 16626.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.286, pruned_loss=0.05332, over 30104119.64 frames. ], batch size: 241, lr: 2.35e-03, grad_scale: 32.0 2023-10-12 02:49:17,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.08 vs. limit=22.5 2023-10-12 02:49:32,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=915558.0, ans=0.125 2023-10-12 02:49:34,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=915558.0, ans=0.125 2023-10-12 02:49:39,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=12.0 2023-10-12 02:49:41,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.707e+02 1.868e+02 2.029e+02 3.211e+02, threshold=3.737e+02, percent-clipped=0.0 2023-10-12 02:49:54,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=915651.3333333334, ans=0.125 2023-10-12 02:50:09,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=915698.0, ans=0.2 2023-10-12 02:50:13,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=915744.6666666666, ans=0.1 2023-10-12 02:50:19,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=915744.6666666666, ans=0.125 2023-10-12 02:50:24,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=915791.3333333334, ans=0.125 2023-10-12 02:50:24,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.64 vs. limit=15.0 2023-10-12 02:50:30,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=915791.3333333334, ans=0.125 2023-10-12 02:50:58,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=915884.6666666666, ans=0.125 2023-10-12 02:51:00,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=915884.6666666666, ans=0.0 2023-10-12 02:51:14,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=915978.0, ans=0.125 2023-10-12 02:51:18,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=915978.0, ans=0.125 2023-10-12 02:51:30,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=916024.6666666666, ans=0.125 2023-10-12 02:51:36,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.669e+02 1.787e+02 1.999e+02 3.159e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-12 02:51:41,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-12 02:52:23,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916258.0, ans=0.1 2023-10-12 02:52:33,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=916304.6666666666, ans=0.0 2023-10-12 02:52:42,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=916351.3333333334, ans=0.125 2023-10-12 02:52:45,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=916351.3333333334, ans=0.125 2023-10-12 02:52:52,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916398.0, ans=0.1 2023-10-12 02:52:53,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=916398.0, ans=0.0 2023-10-12 02:52:54,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=916398.0, ans=0.125 2023-10-12 02:52:55,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=916398.0, ans=0.2 2023-10-12 02:53:14,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916491.3333333334, ans=0.1 2023-10-12 02:53:14,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-12 02:53:25,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.721e+02 1.949e+02 2.277e+02 3.151e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 02:53:27,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=916538.0, ans=0.0 2023-10-12 02:53:33,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=916538.0, ans=0.125 2023-10-12 02:53:41,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916584.6666666666, ans=0.1 2023-10-12 02:53:42,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=916584.6666666666, ans=0.125 2023-10-12 02:54:16,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-10-12 02:54:18,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=916724.6666666666, ans=0.0 2023-10-12 02:54:31,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=916771.3333333334, ans=0.0 2023-10-12 02:54:36,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=916818.0, ans=0.0 2023-10-12 02:54:42,693 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:54:42,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2023-10-12 02:54:47,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=916818.0, ans=0.0 2023-10-12 02:54:51,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=916864.6666666666, ans=0.125 2023-10-12 02:54:56,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916864.6666666666, ans=0.1 2023-10-12 02:54:57,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=916864.6666666666, ans=0.125 2023-10-12 02:54:59,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.27 vs. limit=15.0 2023-10-12 02:55:09,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=916911.3333333334, ans=0.035 2023-10-12 02:55:12,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=916958.0, ans=0.0 2023-10-12 02:55:24,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.662e+02 1.821e+02 1.998e+02 3.002e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-12 02:55:38,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=917051.3333333334, ans=0.1 2023-10-12 02:55:42,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-10-12 02:55:45,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=917051.3333333334, ans=0.025 2023-10-12 02:55:47,649 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:55:57,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=15.0 2023-10-12 02:55:59,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=917144.6666666666, ans=0.0 2023-10-12 02:56:16,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=917191.3333333334, ans=0.2 2023-10-12 02:56:16,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-10-12 02:56:18,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=917191.3333333334, ans=0.1 2023-10-12 02:56:35,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=917284.6666666666, ans=0.125 2023-10-12 02:56:52,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2023-10-12 02:56:56,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=917378.0, ans=0.2 2023-10-12 02:57:03,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=917378.0, ans=0.5 2023-10-12 02:57:14,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.44 vs. limit=15.0 2023-10-12 02:57:17,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.661e+02 1.823e+02 1.960e+02 3.194e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-12 02:57:22,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=917471.3333333334, ans=0.2 2023-10-12 02:57:36,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=917518.0, ans=0.07 2023-10-12 02:57:47,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=917564.6666666666, ans=0.2 2023-10-12 02:58:03,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=917658.0, ans=0.2 2023-10-12 02:58:09,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-10-12 02:58:35,998 INFO [train.py:1031] (3/4) Epoch 15, batch 5500, loss[loss=0.2083, simple_loss=0.2945, pruned_loss=0.06111, over 16118.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2857, pruned_loss=0.05322, over 30671792.49 frames. ], batch size: 296, lr: 2.34e-03, grad_scale: 32.0 2023-10-12 02:59:06,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.660e+02 1.785e+02 1.989e+02 2.915e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-12 02:59:17,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=917984.6666666666, ans=0.2 2023-10-12 02:59:17,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=917984.6666666666, ans=0.125 2023-10-12 02:59:25,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=917984.6666666666, ans=0.0 2023-10-12 02:59:26,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=22.5 2023-10-12 03:00:06,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=918171.3333333334, ans=0.125 2023-10-12 03:00:09,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2023-10-12 03:00:20,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=918218.0, ans=0.125 2023-10-12 03:00:33,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=918311.3333333334, ans=0.0 2023-10-12 03:00:56,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.766e+02 1.963e+02 2.144e+02 3.137e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-12 03:01:04,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=918404.6666666666, ans=0.125 2023-10-12 03:01:05,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=918404.6666666666, ans=0.125 2023-10-12 03:01:12,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=918451.3333333334, ans=0.125 2023-10-12 03:01:17,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.77 vs. limit=15.0 2023-10-12 03:02:01,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=918638.0, ans=0.05 2023-10-12 03:02:25,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=918731.3333333334, ans=10.0 2023-10-12 03:02:51,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.673e+02 1.868e+02 2.178e+02 3.112e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 03:03:06,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=918918.0, ans=0.1 2023-10-12 03:03:09,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=918918.0, ans=0.0 2023-10-12 03:03:24,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=919011.3333333334, ans=0.0 2023-10-12 03:03:30,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=919011.3333333334, ans=0.125 2023-10-12 03:03:38,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=919058.0, ans=0.1 2023-10-12 03:03:40,810 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:03:45,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=919104.6666666666, ans=15.0 2023-10-12 03:03:56,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.64 vs. limit=15.0 2023-10-12 03:04:28,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=919244.6666666666, ans=0.125 2023-10-12 03:04:33,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919244.6666666666, ans=0.1 2023-10-12 03:04:33,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=919244.6666666666, ans=0.1 2023-10-12 03:04:39,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=919291.3333333334, ans=10.0 2023-10-12 03:04:46,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.736e+02 1.896e+02 2.130e+02 3.157e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 03:05:05,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=919384.6666666666, ans=0.0 2023-10-12 03:05:12,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=919431.3333333334, ans=0.1 2023-10-12 03:05:17,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919431.3333333334, ans=0.1 2023-10-12 03:05:42,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.32 vs. limit=15.0 2023-10-12 03:05:45,153 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.94 vs. limit=22.5 2023-10-12 03:05:52,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=22.5 2023-10-12 03:05:58,412 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:06:02,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=919618.0, ans=0.125 2023-10-12 03:06:13,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-10-12 03:06:30,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=919758.0, ans=0.125 2023-10-12 03:06:35,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=919758.0, ans=0.0 2023-10-12 03:06:42,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.663e+02 1.806e+02 2.065e+02 2.820e+02, threshold=3.611e+02, percent-clipped=0.0 2023-10-12 03:06:51,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=919851.3333333334, ans=0.0 2023-10-12 03:06:59,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=919851.3333333334, ans=0.125 2023-10-12 03:06:59,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=919851.3333333334, ans=0.0 2023-10-12 03:07:26,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-10-12 03:07:26,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=919991.3333333334, ans=0.125 2023-10-12 03:07:31,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919991.3333333334, ans=0.1 2023-10-12 03:07:56,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-10-12 03:07:57,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=920131.3333333334, ans=0.125 2023-10-12 03:07:58,315 INFO [train.py:1031] (3/4) Epoch 15, batch 6000, loss[loss=0.1893, simple_loss=0.2777, pruned_loss=0.05045, over 16884.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2861, pruned_loss=0.05346, over 31166406.73 frames. ], batch size: 110, lr: 2.34e-03, grad_scale: 16.0 2023-10-12 03:08:12,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-12 03:08:26,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=920224.6666666666, ans=0.07 2023-10-12 03:08:27,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=920224.6666666666, ans=0.0 2023-10-12 03:08:28,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=920224.6666666666, ans=0.125 2023-10-12 03:08:30,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=920271.3333333334, ans=0.125 2023-10-12 03:08:32,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.695e+02 1.842e+02 2.023e+02 2.720e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-12 03:08:46,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=920318.0, ans=0.0 2023-10-12 03:08:50,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=920318.0, ans=0.125 2023-10-12 03:08:55,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=920364.6666666666, ans=0.125 2023-10-12 03:09:06,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=920411.3333333334, ans=0.1 2023-10-12 03:09:11,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=920411.3333333334, ans=0.125 2023-10-12 03:09:41,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.71 vs. limit=22.5 2023-10-12 03:09:58,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=920644.6666666666, ans=0.125 2023-10-12 03:10:02,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=12.0 2023-10-12 03:10:21,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.709e+02 1.870e+02 2.124e+02 2.739e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-12 03:10:28,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=920738.0, ans=0.2 2023-10-12 03:10:34,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920784.6666666666, ans=0.1 2023-10-12 03:10:40,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=920831.3333333334, ans=0.2 2023-10-12 03:10:40,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=22.5 2023-10-12 03:10:55,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=920878.0, ans=0.0 2023-10-12 03:10:59,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=920878.0, ans=0.1 2023-10-12 03:11:00,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=920878.0, ans=0.125 2023-10-12 03:11:02,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=920924.6666666666, ans=0.125 2023-10-12 03:11:03,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-10-12 03:11:09,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-10-12 03:11:17,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=920971.3333333334, ans=0.0 2023-10-12 03:11:24,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920971.3333333334, ans=0.1 2023-10-12 03:11:32,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.24 vs. limit=15.0 2023-10-12 03:11:45,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-10-12 03:11:49,953 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=15.0 2023-10-12 03:12:03,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-10-12 03:12:14,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.782e+02 1.949e+02 2.266e+02 3.079e+02, threshold=3.898e+02, percent-clipped=0.0 2023-10-12 03:12:32,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=921298.0, ans=0.015 2023-10-12 03:12:44,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=921344.6666666666, ans=0.0 2023-10-12 03:12:50,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=921344.6666666666, ans=0.125 2023-10-12 03:12:57,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=921391.3333333334, ans=0.0 2023-10-12 03:13:41,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=921531.3333333334, ans=0.125 2023-10-12 03:13:42,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=22.5 2023-10-12 03:13:52,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=921624.6666666666, ans=0.125 2023-10-12 03:13:52,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=921624.6666666666, ans=0.125 2023-10-12 03:14:08,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.862e+02 2.069e+02 2.311e+02 3.162e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-12 03:14:30,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=921764.6666666666, ans=0.1 2023-10-12 03:14:35,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=921764.6666666666, ans=0.125 2023-10-12 03:15:05,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=921858.0, ans=0.0 2023-10-12 03:15:30,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=921951.3333333334, ans=0.125 2023-10-12 03:15:31,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=921998.0, ans=0.125 2023-10-12 03:15:32,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921998.0, ans=0.125 2023-10-12 03:15:41,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=921998.0, ans=0.0 2023-10-12 03:15:44,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=922044.6666666666, ans=0.125 2023-10-12 03:15:52,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=922044.6666666666, ans=0.125 2023-10-12 03:15:58,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=922091.3333333334, ans=0.1 2023-10-12 03:16:04,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.41 vs. limit=12.0 2023-10-12 03:16:10,452 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.622e+02 1.795e+02 2.132e+02 3.159e+02, threshold=3.589e+02, percent-clipped=0.0 2023-10-12 03:16:17,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922184.6666666666, ans=0.1 2023-10-12 03:16:21,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-12 03:16:27,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-10-12 03:16:33,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=922231.3333333334, ans=0.125 2023-10-12 03:16:38,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=922231.3333333334, ans=0.1 2023-10-12 03:16:50,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=922278.0, ans=0.1 2023-10-12 03:17:10,729 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-10-12 03:17:18,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=922418.0, ans=0.125 2023-10-12 03:17:25,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=922418.0, ans=0.125 2023-10-12 03:17:31,685 INFO [train.py:1031] (3/4) Epoch 15, batch 6500, loss[loss=0.2166, simple_loss=0.3106, pruned_loss=0.06135, over 16868.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2867, pruned_loss=0.05374, over 31518684.09 frames. ], batch size: 188, lr: 2.34e-03, grad_scale: 16.0 2023-10-12 03:18:03,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=922558.0, ans=0.1 2023-10-12 03:18:10,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.71 vs. limit=22.5 2023-10-12 03:18:14,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=922604.6666666666, ans=0.125 2023-10-12 03:18:16,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=922604.6666666666, ans=0.0 2023-10-12 03:18:17,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.758e+02 1.901e+02 2.104e+02 2.557e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-12 03:18:34,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=922651.3333333334, ans=0.125 2023-10-12 03:18:56,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=922744.6666666666, ans=0.2 2023-10-12 03:18:58,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-12 03:19:11,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=922791.3333333334, ans=0.0 2023-10-12 03:19:21,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=922838.0, ans=0.04949747468305833 2023-10-12 03:19:27,571 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:19:37,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=15.0 2023-10-12 03:19:39,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=922931.3333333334, ans=0.125 2023-10-12 03:19:49,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.88 vs. limit=15.0 2023-10-12 03:19:57,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923024.6666666666, ans=0.1 2023-10-12 03:20:00,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-10-12 03:20:09,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=923071.3333333334, ans=0.0 2023-10-12 03:20:11,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.710e+02 1.917e+02 2.096e+02 2.986e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-12 03:20:32,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=923164.6666666666, ans=0.125 2023-10-12 03:20:38,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.96 vs. limit=6.0 2023-10-12 03:20:56,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2023-10-12 03:21:11,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923351.3333333334, ans=0.1 2023-10-12 03:21:17,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-10-12 03:21:21,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=923398.0, ans=0.2 2023-10-12 03:21:31,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.13 vs. limit=10.0 2023-10-12 03:21:50,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923491.3333333334, ans=0.1 2023-10-12 03:21:51,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=923491.3333333334, ans=0.0 2023-10-12 03:21:58,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.657e+02 1.816e+02 2.074e+02 2.899e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-12 03:21:59,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=923538.0, ans=0.125 2023-10-12 03:22:22,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=923631.3333333334, ans=0.125 2023-10-12 03:22:29,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=923678.0, ans=0.125 2023-10-12 03:22:37,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=923678.0, ans=0.0 2023-10-12 03:22:59,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=923771.3333333334, ans=0.0 2023-10-12 03:23:02,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=923771.3333333334, ans=0.125 2023-10-12 03:23:03,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.73 vs. limit=15.0 2023-10-12 03:23:32,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=923864.6666666666, ans=0.125 2023-10-12 03:23:46,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-10-12 03:24:03,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=924004.6666666666, ans=0.0 2023-10-12 03:24:06,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=924004.6666666666, ans=0.0 2023-10-12 03:24:07,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.642e+02 1.830e+02 2.021e+02 2.847e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-12 03:24:39,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=924144.6666666666, ans=0.125 2023-10-12 03:25:32,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-12 03:25:33,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=924331.3333333334, ans=0.0 2023-10-12 03:25:53,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=924424.6666666666, ans=0.05 2023-10-12 03:26:01,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=924471.3333333334, ans=0.0 2023-10-12 03:26:01,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-10-12 03:26:01,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.622e+02 1.788e+02 1.977e+02 2.604e+02, threshold=3.577e+02, percent-clipped=0.0 2023-10-12 03:26:04,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=924471.3333333334, ans=0.125 2023-10-12 03:26:09,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=924518.0, ans=0.125 2023-10-12 03:26:26,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924564.6666666666, ans=0.1 2023-10-12 03:26:26,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-10-12 03:26:34,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-10-12 03:26:41,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=924611.3333333334, ans=0.125 2023-10-12 03:26:42,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=924658.0, ans=0.0 2023-10-12 03:26:42,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=924658.0, ans=0.0 2023-10-12 03:26:43,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=924658.0, ans=0.125 2023-10-12 03:26:52,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=924704.6666666666, ans=0.125 2023-10-12 03:26:57,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=924704.6666666666, ans=0.125 2023-10-12 03:26:59,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=924704.6666666666, ans=0.125 2023-10-12 03:27:02,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=924704.6666666666, ans=0.125 2023-10-12 03:27:14,360 INFO [train.py:1031] (3/4) Epoch 15, batch 7000, loss[loss=0.1825, simple_loss=0.2774, pruned_loss=0.04379, over 16878.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.287, pruned_loss=0.05354, over 31807753.26 frames. ], batch size: 98, lr: 2.34e-03, grad_scale: 32.0 2023-10-12 03:27:14,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=924798.0, ans=0.125 2023-10-12 03:27:33,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-10-12 03:27:51,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=924938.0, ans=0.05 2023-10-12 03:27:55,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.759e+02 1.899e+02 2.182e+02 3.212e+02, threshold=3.799e+02, percent-clipped=0.0 2023-10-12 03:28:01,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=924938.0, ans=0.125 2023-10-12 03:28:16,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-10-12 03:28:36,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-10-12 03:28:38,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=925124.6666666666, ans=0.09899494936611666 2023-10-12 03:28:39,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925124.6666666666, ans=0.1 2023-10-12 03:28:47,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=925124.6666666666, ans=0.0 2023-10-12 03:29:11,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.42 vs. limit=15.0 2023-10-12 03:29:17,237 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:29:31,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=925311.3333333334, ans=0.09899494936611666 2023-10-12 03:29:35,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=925358.0, ans=0.015 2023-10-12 03:29:35,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=925358.0, ans=0.2 2023-10-12 03:29:38,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=925358.0, ans=0.125 2023-10-12 03:29:42,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=925358.0, ans=0.2 2023-10-12 03:29:51,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.810e+02 1.962e+02 2.161e+02 3.267e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 03:30:00,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=15.0 2023-10-12 03:30:03,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=925451.3333333334, ans=0.0 2023-10-12 03:30:04,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925451.3333333334, ans=0.125 2023-10-12 03:30:16,371 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.22 vs. limit=15.0 2023-10-12 03:30:22,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=925544.6666666666, ans=0.125 2023-10-12 03:30:22,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=925544.6666666666, ans=0.0 2023-10-12 03:30:27,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-10-12 03:30:30,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-10-12 03:30:32,096 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:30:36,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=925591.3333333334, ans=0.2 2023-10-12 03:30:50,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=925638.0, ans=0.125 2023-10-12 03:30:59,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925684.6666666666, ans=0.1 2023-10-12 03:30:59,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=925684.6666666666, ans=0.2 2023-10-12 03:31:09,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=925731.3333333334, ans=22.5 2023-10-12 03:31:28,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=925778.0, ans=0.0 2023-10-12 03:31:39,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=925778.0, ans=0.125 2023-10-12 03:31:51,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=925824.6666666666, ans=0.125 2023-10-12 03:31:56,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.683e+02 1.852e+02 2.052e+02 2.591e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-12 03:32:06,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=925918.0, ans=0.0 2023-10-12 03:32:10,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=925918.0, ans=0.125 2023-10-12 03:32:25,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=925964.6666666666, ans=0.125 2023-10-12 03:32:46,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=926058.0, ans=0.125 2023-10-12 03:32:49,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=926058.0, ans=0.0 2023-10-12 03:33:21,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=15.0 2023-10-12 03:33:37,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.37 vs. limit=12.0 2023-10-12 03:33:46,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=926291.3333333334, ans=0.125 2023-10-12 03:33:46,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.23 vs. limit=22.5 2023-10-12 03:33:49,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=926291.3333333334, ans=10.0 2023-10-12 03:33:51,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=926291.3333333334, ans=0.0 2023-10-12 03:33:58,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.268e+02 1.690e+02 1.804e+02 1.982e+02 2.772e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-12 03:34:03,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=926338.0, ans=0.125 2023-10-12 03:34:16,370 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:34:34,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=926478.0, ans=0.0 2023-10-12 03:34:45,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=926524.6666666666, ans=0.0 2023-10-12 03:35:02,092 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.77 vs. limit=22.5 2023-10-12 03:35:18,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=926664.6666666666, ans=0.125 2023-10-12 03:35:18,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=15.0 2023-10-12 03:35:22,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=926664.6666666666, ans=0.0 2023-10-12 03:35:25,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=926664.6666666666, ans=0.1 2023-10-12 03:35:35,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=926711.3333333334, ans=0.0 2023-10-12 03:35:42,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=926758.0, ans=0.125 2023-10-12 03:35:55,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.769e+02 2.005e+02 2.243e+02 2.986e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-12 03:36:20,968 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:36:28,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=926944.6666666666, ans=0.125 2023-10-12 03:36:43,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=926991.3333333334, ans=0.125 2023-10-12 03:36:49,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=927038.0, ans=0.125 2023-10-12 03:37:11,343 INFO [train.py:1031] (3/4) Epoch 15, batch 7500, loss[loss=0.1864, simple_loss=0.2862, pruned_loss=0.04333, over 16705.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2871, pruned_loss=0.05355, over 32063279.89 frames. ], batch size: 81, lr: 2.33e-03, grad_scale: 16.0 2023-10-12 03:37:14,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=927131.3333333334, ans=0.125 2023-10-12 03:37:15,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=927131.3333333334, ans=0.0 2023-10-12 03:37:48,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.697e+02 1.862e+02 2.064e+02 3.666e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-12 03:37:51,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=927271.3333333334, ans=0.5 2023-10-12 03:38:15,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.39 vs. limit=10.0 2023-10-12 03:38:28,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.49 vs. limit=22.5 2023-10-12 03:38:37,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2023-10-12 03:38:47,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=927504.6666666666, ans=0.125 2023-10-12 03:38:47,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=927504.6666666666, ans=0.125 2023-10-12 03:38:48,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927504.6666666666, ans=0.1 2023-10-12 03:39:04,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=927598.0, ans=0.0 2023-10-12 03:39:09,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=22.5 2023-10-12 03:39:13,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=927598.0, ans=0.125 2023-10-12 03:39:37,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=927691.3333333334, ans=10.0 2023-10-12 03:39:49,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.739e+02 1.863e+02 2.084e+02 2.661e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-12 03:40:15,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=22.5 2023-10-12 03:40:57,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.65 vs. limit=22.5 2023-10-12 03:40:58,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=927971.3333333334, ans=0.125 2023-10-12 03:40:59,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=928018.0, ans=0.125 2023-10-12 03:41:05,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=928018.0, ans=0.0 2023-10-12 03:41:10,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=928018.0, ans=0.05 2023-10-12 03:41:21,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=928064.6666666666, ans=0.125 2023-10-12 03:41:22,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=928064.6666666666, ans=0.2 2023-10-12 03:41:30,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=928111.3333333334, ans=0.5 2023-10-12 03:41:32,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=928111.3333333334, ans=0.125 2023-10-12 03:41:44,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-10-12 03:41:45,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=928158.0, ans=0.2 2023-10-12 03:41:53,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.651e+02 1.791e+02 1.930e+02 3.019e+02, threshold=3.583e+02, percent-clipped=0.0 2023-10-12 03:42:07,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=928251.3333333334, ans=0.125 2023-10-12 03:42:18,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=928298.0, ans=0.125 2023-10-12 03:42:28,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=928344.6666666666, ans=0.1 2023-10-12 03:42:45,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928438.0, ans=0.1 2023-10-12 03:42:57,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=928484.6666666666, ans=0.0 2023-10-12 03:42:59,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=928484.6666666666, ans=0.0 2023-10-12 03:43:05,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928531.3333333334, ans=0.1 2023-10-12 03:43:07,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=928531.3333333334, ans=0.125 2023-10-12 03:43:23,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=928578.0, ans=0.125 2023-10-12 03:43:25,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=928578.0, ans=0.125 2023-10-12 03:43:26,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=928578.0, ans=0.125 2023-10-12 03:43:37,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928624.6666666666, ans=0.1 2023-10-12 03:43:44,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=928671.3333333334, ans=0.125 2023-10-12 03:43:50,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.761e+02 1.904e+02 2.172e+02 2.664e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 03:43:55,838 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:43:59,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2023-10-12 03:44:24,599 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=12.0 2023-10-12 03:44:42,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-10-12 03:44:50,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.42 vs. limit=15.0 2023-10-12 03:44:51,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=928904.6666666666, ans=0.125 2023-10-12 03:44:53,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=928951.3333333334, ans=0.2 2023-10-12 03:44:54,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=928951.3333333334, ans=0.2 2023-10-12 03:44:55,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928951.3333333334, ans=0.1 2023-10-12 03:45:09,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=928998.0, ans=0.0 2023-10-12 03:45:10,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.72 vs. limit=15.0 2023-10-12 03:45:21,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=929044.6666666666, ans=0.2 2023-10-12 03:45:36,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=929091.3333333334, ans=0.125 2023-10-12 03:45:48,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.669e+02 1.824e+02 2.023e+02 2.861e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-12 03:45:53,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=929184.6666666666, ans=0.95 2023-10-12 03:46:08,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=929231.3333333334, ans=0.2 2023-10-12 03:46:15,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=929231.3333333334, ans=0.0 2023-10-12 03:46:16,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=929231.3333333334, ans=0.125 2023-10-12 03:46:16,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=929231.3333333334, ans=0.1 2023-10-12 03:46:23,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=929278.0, ans=0.0 2023-10-12 03:46:29,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=929324.6666666666, ans=10.0 2023-10-12 03:46:36,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=929324.6666666666, ans=0.125 2023-10-12 03:46:49,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=929371.3333333334, ans=0.0 2023-10-12 03:46:56,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-10-12 03:46:58,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=929418.0, ans=0.125 2023-10-12 03:47:06,096 INFO [train.py:1031] (3/4) Epoch 15, batch 8000, loss[loss=0.185, simple_loss=0.2536, pruned_loss=0.05816, over 12835.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2865, pruned_loss=0.05293, over 32248661.75 frames. ], batch size: 440, lr: 2.33e-03, grad_scale: 32.0 2023-10-12 03:47:14,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=929464.6666666666, ans=0.125 2023-10-12 03:47:19,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=12.0 2023-10-12 03:47:20,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=929511.3333333334, ans=0.0 2023-10-12 03:47:44,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=929604.6666666666, ans=0.125 2023-10-12 03:47:45,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.585e+02 1.702e+02 1.898e+02 3.170e+02, threshold=3.404e+02, percent-clipped=0.0 2023-10-12 03:48:02,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=929698.0, ans=0.1 2023-10-12 03:48:09,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=929698.0, ans=0.125 2023-10-12 03:48:14,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=929744.6666666666, ans=0.125 2023-10-12 03:48:20,701 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:48:27,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=929791.3333333334, ans=0.5 2023-10-12 03:48:28,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=929791.3333333334, ans=0.125 2023-10-12 03:48:32,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=929791.3333333334, ans=0.125 2023-10-12 03:48:46,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=929884.6666666666, ans=0.04949747468305833 2023-10-12 03:48:51,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.82 vs. limit=10.0 2023-10-12 03:48:59,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=929931.3333333334, ans=0.125 2023-10-12 03:49:04,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=929931.3333333334, ans=0.0 2023-10-12 03:49:05,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929978.0, ans=0.1 2023-10-12 03:49:14,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=929978.0, ans=0.2 2023-10-12 03:49:20,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-10-12 03:49:23,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=930024.6666666666, ans=0.0 2023-10-12 03:49:32,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.705e+02 1.792e+02 1.964e+02 2.510e+02, threshold=3.584e+02, percent-clipped=0.0 2023-10-12 03:49:38,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=930118.0, ans=0.125 2023-10-12 03:49:50,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=930118.0, ans=0.125 2023-10-12 03:50:17,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=930211.3333333334, ans=15.0 2023-10-12 03:50:29,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=22.5 2023-10-12 03:51:20,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=930444.6666666666, ans=0.125 2023-10-12 03:51:21,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2023-10-12 03:51:42,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=930538.0, ans=0.05 2023-10-12 03:51:43,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.281e+02 1.700e+02 1.811e+02 2.027e+02 2.668e+02, threshold=3.622e+02, percent-clipped=0.0 2023-10-12 03:52:00,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=930631.3333333334, ans=0.125 2023-10-12 03:52:04,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=930631.3333333334, ans=0.04949747468305833 2023-10-12 03:52:33,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=930771.3333333334, ans=0.2 2023-10-12 03:52:35,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=930771.3333333334, ans=0.1 2023-10-12 03:52:55,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930864.6666666666, ans=0.1 2023-10-12 03:52:56,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=930864.6666666666, ans=10.0 2023-10-12 03:53:13,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-10-12 03:53:17,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=930911.3333333334, ans=0.125 2023-10-12 03:53:31,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931004.6666666666, ans=0.1 2023-10-12 03:53:35,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=931004.6666666666, ans=0.2 2023-10-12 03:53:36,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.703e+02 1.902e+02 2.056e+02 2.906e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-12 03:54:08,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=931144.6666666666, ans=0.125 2023-10-12 03:54:18,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=931191.3333333334, ans=0.05 2023-10-12 03:54:21,728 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:54:21,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-10-12 03:54:38,919 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:55:00,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=931378.0, ans=0.1 2023-10-12 03:55:00,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=931378.0, ans=0.0 2023-10-12 03:55:11,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931424.6666666666, ans=0.1 2023-10-12 03:55:33,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.816e+02 2.076e+02 2.357e+02 3.127e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-12 03:55:44,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-10-12 03:55:45,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=931518.0, ans=0.125 2023-10-12 03:55:46,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=931518.0, ans=0.0 2023-10-12 03:56:45,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-12 03:56:48,273 INFO [train.py:1031] (3/4) Epoch 15, batch 8500, loss[loss=0.1982, simple_loss=0.285, pruned_loss=0.05563, over 15666.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2867, pruned_loss=0.05285, over 32370928.65 frames. ], batch size: 35, lr: 2.33e-03, grad_scale: 16.0 2023-10-12 03:56:53,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=931798.0, ans=0.0 2023-10-12 03:56:59,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-10-12 03:57:04,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=931844.6666666666, ans=0.125 2023-10-12 03:57:10,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=931891.3333333334, ans=0.0 2023-10-12 03:57:14,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931891.3333333334, ans=0.1 2023-10-12 03:57:16,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931891.3333333334, ans=0.1 2023-10-12 03:57:30,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.748e+02 1.958e+02 2.292e+02 3.324e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-12 03:58:31,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=932171.3333333334, ans=0.0 2023-10-12 03:58:41,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=932218.0, ans=0.125 2023-10-12 03:59:33,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=932404.6666666666, ans=0.0 2023-10-12 03:59:33,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2023-10-12 03:59:35,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.689e+02 1.910e+02 2.109e+02 2.909e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 03:59:48,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=932451.3333333334, ans=0.1 2023-10-12 03:59:49,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932451.3333333334, ans=0.1 2023-10-12 04:00:14,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.74 vs. limit=15.0 2023-10-12 04:00:17,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=932591.3333333334, ans=0.125 2023-10-12 04:00:23,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-10-12 04:00:30,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.81 vs. limit=15.0 2023-10-12 04:01:08,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.41 vs. limit=22.5 2023-10-12 04:01:19,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=932824.6666666666, ans=0.95 2023-10-12 04:01:22,494 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-12 04:01:23,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=932824.6666666666, ans=0.0 2023-10-12 04:01:35,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932871.3333333334, ans=0.1 2023-10-12 04:01:37,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=932871.3333333334, ans=0.125 2023-10-12 04:01:39,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.690e+02 1.819e+02 2.014e+02 2.879e+02, threshold=3.638e+02, percent-clipped=0.0 2023-10-12 04:02:04,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=932964.6666666666, ans=0.125 2023-10-12 04:02:48,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=933151.3333333334, ans=0.0 2023-10-12 04:02:59,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=933198.0, ans=0.0 2023-10-12 04:03:04,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=933198.0, ans=0.0 2023-10-12 04:03:05,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=933198.0, ans=0.07 2023-10-12 04:03:15,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=933244.6666666666, ans=0.125 2023-10-12 04:03:24,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=22.5 2023-10-12 04:03:39,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.696e+02 1.840e+02 2.219e+02 3.082e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-12 04:03:42,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.75 vs. limit=10.0 2023-10-12 04:03:54,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=933431.3333333334, ans=0.125 2023-10-12 04:04:06,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=933478.0, ans=22.5 2023-10-12 04:04:07,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=933478.0, ans=0.125 2023-10-12 04:04:08,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=933478.0, ans=0.0 2023-10-12 04:04:09,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=933478.0, ans=0.125 2023-10-12 04:04:20,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=933524.6666666666, ans=0.04949747468305833 2023-10-12 04:04:25,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=933571.3333333334, ans=0.1 2023-10-12 04:04:35,054 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-10-12 04:04:48,869 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:05:03,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=933711.3333333334, ans=0.125 2023-10-12 04:05:21,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=933758.0, ans=0.0 2023-10-12 04:05:29,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.760e+02 1.961e+02 2.268e+02 3.397e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 04:05:50,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=933898.0, ans=0.1 2023-10-12 04:05:54,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=22.5 2023-10-12 04:05:58,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-10-12 04:05:58,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-10-12 04:06:05,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=933991.3333333334, ans=0.0 2023-10-12 04:06:29,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=934084.6666666666, ans=0.0 2023-10-12 04:06:30,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934084.6666666666, ans=0.1 2023-10-12 04:06:40,696 INFO [train.py:1031] (3/4) Epoch 15, batch 9000, loss[loss=0.2345, simple_loss=0.3231, pruned_loss=0.07297, over 16065.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2861, pruned_loss=0.05275, over 32434680.16 frames. ], batch size: 296, lr: 2.32e-03, grad_scale: 32.0 2023-10-12 04:06:51,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=934178.0, ans=0.125 2023-10-12 04:07:10,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=934224.6666666666, ans=0.125 2023-10-12 04:07:10,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=934224.6666666666, ans=0.1 2023-10-12 04:07:15,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=934271.3333333334, ans=0.125 2023-10-12 04:07:18,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.704e+02 1.924e+02 2.199e+02 3.028e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-12 04:07:41,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-12 04:07:47,645 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:08:08,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=934504.6666666666, ans=0.2 2023-10-12 04:08:30,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=934598.0, ans=0.0 2023-10-12 04:08:42,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=15.0 2023-10-12 04:09:01,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=934738.0, ans=0.0 2023-10-12 04:09:04,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.675e+02 1.807e+02 2.098e+02 2.989e+02, threshold=3.614e+02, percent-clipped=0.0 2023-10-12 04:09:14,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=934784.6666666666, ans=0.125 2023-10-12 04:09:15,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=934784.6666666666, ans=0.2 2023-10-12 04:09:37,681 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:09:59,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934971.3333333334, ans=0.1 2023-10-12 04:10:04,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=935018.0, ans=0.125 2023-10-12 04:10:12,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=935018.0, ans=0.2 2023-10-12 04:10:16,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-10-12 04:10:22,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.64 vs. limit=15.0 2023-10-12 04:10:36,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=935158.0, ans=0.2 2023-10-12 04:10:51,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.759e+02 1.900e+02 2.075e+02 2.991e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-12 04:11:03,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=935251.3333333334, ans=0.2 2023-10-12 04:11:07,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=935298.0, ans=0.2 2023-10-12 04:11:15,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=935344.6666666666, ans=0.1 2023-10-12 04:11:27,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=935391.3333333334, ans=0.0 2023-10-12 04:11:56,245 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:11:56,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-10-12 04:12:00,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=935531.3333333334, ans=0.2 2023-10-12 04:12:14,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-10-12 04:12:20,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-10-12 04:12:36,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.07 vs. limit=15.0 2023-10-12 04:12:41,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.750e+02 2.046e+02 2.411e+02 3.580e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-12 04:12:44,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=935671.3333333334, ans=0.0 2023-10-12 04:13:24,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=935811.3333333334, ans=0.0 2023-10-12 04:13:26,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.95 vs. limit=10.0 2023-10-12 04:13:37,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=935858.0, ans=0.1 2023-10-12 04:13:44,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=935904.6666666666, ans=0.125 2023-10-12 04:13:46,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-10-12 04:14:03,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=935998.0, ans=0.125 2023-10-12 04:14:20,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=936044.6666666666, ans=0.0 2023-10-12 04:14:30,551 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:14:30,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-12 04:14:32,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=936091.3333333334, ans=0.0 2023-10-12 04:14:38,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=936138.0, ans=0.125 2023-10-12 04:14:39,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=936138.0, ans=0.125 2023-10-12 04:14:39,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=936138.0, ans=0.2 2023-10-12 04:14:46,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.800e+02 1.944e+02 2.285e+02 3.152e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-12 04:14:57,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=936184.6666666666, ans=0.0 2023-10-12 04:15:02,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936231.3333333334, ans=0.1 2023-10-12 04:15:07,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.48 vs. limit=15.0 2023-10-12 04:15:23,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=936324.6666666666, ans=0.0 2023-10-12 04:15:28,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=936324.6666666666, ans=0.2 2023-10-12 04:15:59,352 INFO [train.py:1031] (3/4) Epoch 15, batch 9500, loss[loss=0.1876, simple_loss=0.2863, pruned_loss=0.0445, over 16946.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2867, pruned_loss=0.05292, over 32535452.02 frames. ], batch size: 93, lr: 2.32e-03, grad_scale: 16.0 2023-10-12 04:16:00,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=936464.6666666666, ans=0.0 2023-10-12 04:16:06,600 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:16:07,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.52 vs. limit=15.0 2023-10-12 04:16:09,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.64 vs. limit=15.0 2023-10-12 04:16:17,889 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:16:29,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936558.0, ans=0.1 2023-10-12 04:16:33,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=936604.6666666666, ans=0.0 2023-10-12 04:16:40,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.777e+02 2.002e+02 2.178e+02 2.753e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-12 04:16:44,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=936651.3333333334, ans=0.2 2023-10-12 04:16:48,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=936651.3333333334, ans=0.125 2023-10-12 04:17:34,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=936838.0, ans=0.09899494936611666 2023-10-12 04:17:34,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=936838.0, ans=0.0 2023-10-12 04:17:39,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.69 vs. limit=15.0 2023-10-12 04:17:42,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=936884.6666666666, ans=0.125 2023-10-12 04:18:00,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=936931.3333333334, ans=0.0 2023-10-12 04:18:03,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=936978.0, ans=0.125 2023-10-12 04:18:12,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=936978.0, ans=0.0 2023-10-12 04:18:15,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=937024.6666666666, ans=0.125 2023-10-12 04:18:28,381 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:18:34,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.815e+02 2.001e+02 2.371e+02 3.138e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-12 04:18:37,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=937118.0, ans=0.125 2023-10-12 04:18:44,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=937118.0, ans=0.0 2023-10-12 04:18:47,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=937118.0, ans=0.125 2023-10-12 04:18:53,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=937164.6666666666, ans=0.125 2023-10-12 04:19:00,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=937211.3333333334, ans=0.1 2023-10-12 04:19:23,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=937304.6666666666, ans=0.125 2023-10-12 04:20:06,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.23 vs. limit=10.0 2023-10-12 04:20:08,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937491.3333333334, ans=0.125 2023-10-12 04:20:09,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=937491.3333333334, ans=0.125 2023-10-12 04:20:16,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=937491.3333333334, ans=0.0 2023-10-12 04:20:17,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=937538.0, ans=0.125 2023-10-12 04:20:17,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=937538.0, ans=0.5 2023-10-12 04:20:25,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.688e+02 1.828e+02 1.993e+02 3.347e+02, threshold=3.656e+02, percent-clipped=0.0 2023-10-12 04:20:27,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=937584.6666666666, ans=0.125 2023-10-12 04:20:37,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-10-12 04:20:38,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=937584.6666666666, ans=0.0 2023-10-12 04:20:48,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=937631.3333333334, ans=0.2 2023-10-12 04:21:40,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.40 vs. limit=22.5 2023-10-12 04:22:20,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.707e+02 1.887e+02 2.106e+02 2.847e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-12 04:23:01,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=938191.3333333334, ans=0.0 2023-10-12 04:23:02,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=938191.3333333334, ans=0.125 2023-10-12 04:23:15,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-10-12 04:23:19,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=938284.6666666666, ans=0.125 2023-10-12 04:23:26,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=938284.6666666666, ans=0.125 2023-10-12 04:23:34,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-10-12 04:23:36,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=938331.3333333334, ans=0.125 2023-10-12 04:23:43,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=938378.0, ans=0.125 2023-10-12 04:24:05,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.748e+02 1.910e+02 2.064e+02 2.716e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 04:24:07,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938471.3333333334, ans=0.1 2023-10-12 04:24:25,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=938564.6666666666, ans=0.0 2023-10-12 04:24:25,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.75 vs. limit=10.0 2023-10-12 04:24:44,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=938658.0, ans=0.0 2023-10-12 04:24:52,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=938704.6666666666, ans=0.125 2023-10-12 04:25:11,166 INFO [train.py:1031] (3/4) Epoch 15, batch 10000, loss[loss=0.1935, simple_loss=0.2875, pruned_loss=0.04971, over 16903.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2859, pruned_loss=0.05257, over 32596586.43 frames. ], batch size: 82, lr: 2.32e-03, grad_scale: 32.0 2023-10-12 04:25:15,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=15.0 2023-10-12 04:25:31,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=938844.6666666666, ans=0.2 2023-10-12 04:25:32,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=938891.3333333334, ans=0.1 2023-10-12 04:25:52,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.702e+02 1.906e+02 2.079e+02 2.703e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 04:26:15,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=939078.0, ans=0.0 2023-10-12 04:26:18,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=939078.0, ans=0.2 2023-10-12 04:27:27,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=939358.0, ans=0.1 2023-10-12 04:27:46,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.813e+02 2.045e+02 2.298e+02 3.543e+02, threshold=4.090e+02, percent-clipped=0.0 2023-10-12 04:28:01,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=939498.0, ans=0.125 2023-10-12 04:28:08,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939498.0, ans=0.125 2023-10-12 04:28:11,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=939544.6666666666, ans=0.125 2023-10-12 04:28:23,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=939591.3333333334, ans=0.0 2023-10-12 04:28:42,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=939638.0, ans=0.0 2023-10-12 04:28:50,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=939684.6666666666, ans=0.125 2023-10-12 04:28:55,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=939684.6666666666, ans=0.0 2023-10-12 04:29:02,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-12 04:29:17,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=939778.0, ans=0.5 2023-10-12 04:29:29,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=939824.6666666666, ans=0.1 2023-10-12 04:29:45,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.762e+02 1.938e+02 2.134e+02 3.322e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-12 04:30:05,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2023-10-12 04:30:05,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=939964.6666666666, ans=0.125 2023-10-12 04:30:06,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=939964.6666666666, ans=0.125 2023-10-12 04:30:14,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=940011.3333333334, ans=0.125 2023-10-12 04:30:14,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=940011.3333333334, ans=0.2 2023-10-12 04:30:15,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=940011.3333333334, ans=0.0 2023-10-12 04:30:34,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=940058.0, ans=0.125 2023-10-12 04:30:47,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=940151.3333333334, ans=22.5 2023-10-12 04:30:56,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=940151.3333333334, ans=0.0 2023-10-12 04:30:56,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-10-12 04:31:08,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=940198.0, ans=0.125 2023-10-12 04:31:31,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=940291.3333333334, ans=0.125 2023-10-12 04:31:33,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=940338.0, ans=0.0 2023-10-12 04:31:40,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.757e+02 1.932e+02 2.214e+02 2.714e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-12 04:31:44,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=940384.6666666666, ans=0.2 2023-10-12 04:31:53,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=940384.6666666666, ans=0.125 2023-10-12 04:32:07,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=940431.3333333334, ans=0.0 2023-10-12 04:32:22,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=940524.6666666666, ans=0.125 2023-10-12 04:32:29,128 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-10-12 04:32:33,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=940571.3333333334, ans=10.0 2023-10-12 04:32:33,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-10-12 04:33:05,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=940664.6666666666, ans=0.125 2023-10-12 04:33:06,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-10-12 04:33:17,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=940711.3333333334, ans=0.0 2023-10-12 04:33:28,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=940758.0, ans=0.125 2023-10-12 04:33:42,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.776e+02 2.022e+02 2.467e+02 4.054e+02, threshold=4.045e+02, percent-clipped=2.0 2023-10-12 04:33:54,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=940851.3333333334, ans=0.125 2023-10-12 04:34:03,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=940898.0, ans=0.125 2023-10-12 04:34:08,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=940944.6666666666, ans=0.125 2023-10-12 04:34:15,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-12 04:34:24,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-12 04:34:27,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=940991.3333333334, ans=0.0 2023-10-12 04:34:40,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=941038.0, ans=0.125 2023-10-12 04:34:51,526 INFO [train.py:1031] (3/4) Epoch 15, batch 10500, loss[loss=0.1911, simple_loss=0.2863, pruned_loss=0.04796, over 16788.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2863, pruned_loss=0.0528, over 32621083.13 frames. ], batch size: 175, lr: 2.32e-03, grad_scale: 32.0 2023-10-12 04:35:08,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=941178.0, ans=0.125 2023-10-12 04:35:09,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-12 04:35:10,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=941178.0, ans=0.125 2023-10-12 04:35:24,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.78 vs. limit=10.0 2023-10-12 04:35:31,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.645e+02 1.828e+02 2.081e+02 2.557e+02, threshold=3.655e+02, percent-clipped=0.0 2023-10-12 04:35:32,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=941271.3333333334, ans=0.125 2023-10-12 04:35:41,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=941318.0, ans=0.1 2023-10-12 04:36:12,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=15.0 2023-10-12 04:36:17,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=941458.0, ans=0.125 2023-10-12 04:36:25,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=941504.6666666666, ans=0.125 2023-10-12 04:37:02,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=941644.6666666666, ans=0.125 2023-10-12 04:37:03,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=941644.6666666666, ans=0.2 2023-10-12 04:37:16,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=941691.3333333334, ans=0.125 2023-10-12 04:37:34,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.759e+02 1.918e+02 2.153e+02 2.989e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-12 04:37:55,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=941831.3333333334, ans=0.0 2023-10-12 04:37:57,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-10-12 04:38:06,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=15.0 2023-10-12 04:38:17,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=941924.6666666666, ans=0.025 2023-10-12 04:38:28,230 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:38:56,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942064.6666666666, ans=0.125 2023-10-12 04:39:09,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942111.3333333334, ans=0.1 2023-10-12 04:39:11,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-12 04:39:21,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942158.0, ans=0.1 2023-10-12 04:39:22,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=942158.0, ans=0.125 2023-10-12 04:39:35,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.747e+02 1.897e+02 2.079e+02 3.171e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 04:39:43,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=942251.3333333334, ans=0.0 2023-10-12 04:39:50,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=942298.0, ans=0.125 2023-10-12 04:39:52,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=942298.0, ans=0.0 2023-10-12 04:40:09,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=15.0 2023-10-12 04:40:10,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-10-12 04:40:17,560 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-10-12 04:40:35,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=942438.0, ans=0.125 2023-10-12 04:40:42,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=942484.6666666666, ans=0.125 2023-10-12 04:40:42,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-10-12 04:40:57,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.84 vs. limit=15.0 2023-10-12 04:41:23,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942671.3333333334, ans=0.1 2023-10-12 04:41:29,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.689e+02 1.891e+02 2.126e+02 2.910e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-12 04:41:29,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=942671.3333333334, ans=0.0 2023-10-12 04:41:31,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=942718.0, ans=0.125 2023-10-12 04:41:34,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=942718.0, ans=0.0 2023-10-12 04:41:54,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=942811.3333333334, ans=0.125 2023-10-12 04:42:04,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942811.3333333334, ans=0.1 2023-10-12 04:42:09,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=942858.0, ans=0.125 2023-10-12 04:42:23,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=942904.6666666666, ans=0.125 2023-10-12 04:42:47,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=942998.0, ans=0.125 2023-10-12 04:42:56,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=943044.6666666666, ans=0.0 2023-10-12 04:43:22,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.614e+02 1.745e+02 1.899e+02 2.509e+02, threshold=3.489e+02, percent-clipped=0.0 2023-10-12 04:44:04,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-10-12 04:44:13,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=943371.3333333334, ans=0.07 2023-10-12 04:44:30,142 INFO [train.py:1031] (3/4) Epoch 15, batch 11000, loss[loss=0.2459, simple_loss=0.3173, pruned_loss=0.08726, over 15807.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2863, pruned_loss=0.05269, over 32671941.82 frames. ], batch size: 350, lr: 2.31e-03, grad_scale: 16.0 2023-10-12 04:44:39,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=943464.6666666666, ans=0.0 2023-10-12 04:44:44,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=943511.3333333334, ans=0.0 2023-10-12 04:44:45,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=943511.3333333334, ans=0.2 2023-10-12 04:44:54,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-10-12 04:45:03,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=943604.6666666666, ans=0.125 2023-10-12 04:45:12,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.679e+02 1.854e+02 2.071e+02 3.107e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 04:45:48,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=943744.6666666666, ans=0.125 2023-10-12 04:45:56,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=943791.3333333334, ans=0.125 2023-10-12 04:46:23,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=943884.6666666666, ans=0.0 2023-10-12 04:46:55,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.81 vs. limit=15.0 2023-10-12 04:47:16,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=944071.3333333334, ans=0.2 2023-10-12 04:47:18,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.231e+02 1.677e+02 1.834e+02 2.076e+02 3.870e+02, threshold=3.669e+02, percent-clipped=1.0 2023-10-12 04:47:21,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=944118.0, ans=0.0 2023-10-12 04:47:28,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=944118.0, ans=0.125 2023-10-12 04:47:28,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=944118.0, ans=0.0 2023-10-12 04:47:40,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=944164.6666666666, ans=0.95 2023-10-12 04:47:47,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=944211.3333333334, ans=0.125 2023-10-12 04:47:51,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=15.0 2023-10-12 04:48:22,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=944351.3333333334, ans=0.125 2023-10-12 04:48:27,317 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:48:31,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=944398.0, ans=0.125 2023-10-12 04:48:33,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=944398.0, ans=0.125 2023-10-12 04:48:36,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=944444.6666666666, ans=0.05 2023-10-12 04:48:55,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=944491.3333333334, ans=0.04949747468305833 2023-10-12 04:49:07,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944538.0, ans=0.125 2023-10-12 04:49:11,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.686e+02 1.962e+02 2.213e+02 3.457e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-12 04:49:20,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-12 04:49:26,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944631.3333333334, ans=0.1 2023-10-12 04:49:30,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=944631.3333333334, ans=0.0 2023-10-12 04:49:33,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=944678.0, ans=0.1 2023-10-12 04:50:01,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.83 vs. limit=15.0 2023-10-12 04:50:45,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=944911.3333333334, ans=0.125 2023-10-12 04:50:49,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=944958.0, ans=0.0 2023-10-12 04:51:07,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.685e+02 1.820e+02 1.989e+02 2.541e+02, threshold=3.640e+02, percent-clipped=0.0 2023-10-12 04:51:22,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=945098.0, ans=0.125 2023-10-12 04:51:24,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=15.0 2023-10-12 04:51:33,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=12.0 2023-10-12 04:51:46,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945191.3333333334, ans=0.1 2023-10-12 04:52:13,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=945284.6666666666, ans=0.125 2023-10-12 04:52:16,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=945284.6666666666, ans=0.1 2023-10-12 04:52:21,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=945331.3333333334, ans=0.125 2023-10-12 04:52:25,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=945331.3333333334, ans=0.0 2023-10-12 04:52:40,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=945424.6666666666, ans=0.0 2023-10-12 04:52:42,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=945424.6666666666, ans=0.125 2023-10-12 04:52:42,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=945424.6666666666, ans=0.125 2023-10-12 04:52:45,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945424.6666666666, ans=0.1 2023-10-12 04:52:46,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=945424.6666666666, ans=0.0 2023-10-12 04:52:56,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=945471.3333333334, ans=0.0 2023-10-12 04:53:06,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.755e+02 1.894e+02 2.250e+02 3.690e+02, threshold=3.788e+02, percent-clipped=1.0 2023-10-12 04:53:16,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=945518.0, ans=0.1 2023-10-12 04:53:37,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.85 vs. limit=22.5 2023-10-12 04:53:40,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=945658.0, ans=0.05 2023-10-12 04:53:58,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=945704.6666666666, ans=0.02 2023-10-12 04:54:04,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=945751.3333333334, ans=0.125 2023-10-12 04:54:07,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=945751.3333333334, ans=0.125 2023-10-12 04:54:11,819 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:54:11,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=945798.0, ans=0.0 2023-10-12 04:54:12,392 INFO [train.py:1031] (3/4) Epoch 15, batch 11500, loss[loss=0.2154, simple_loss=0.3032, pruned_loss=0.06379, over 16101.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.286, pruned_loss=0.05258, over 32682057.23 frames. ], batch size: 296, lr: 2.31e-03, grad_scale: 32.0 2023-10-12 04:54:14,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=945798.0, ans=0.125 2023-10-12 04:54:34,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=945891.3333333334, ans=0.125 2023-10-12 04:54:35,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=945891.3333333334, ans=0.125 2023-10-12 04:54:54,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.749e+02 1.922e+02 2.149e+02 2.799e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-12 04:54:58,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=945984.6666666666, ans=0.0 2023-10-12 04:55:07,223 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:55:12,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=946031.3333333334, ans=0.2 2023-10-12 04:55:33,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=946078.0, ans=0.125 2023-10-12 04:55:39,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=946124.6666666666, ans=0.125 2023-10-12 04:56:28,765 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:56:44,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=946358.0, ans=0.2 2023-10-12 04:56:46,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=946404.6666666666, ans=0.0 2023-10-12 04:56:54,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=946404.6666666666, ans=0.0 2023-10-12 04:56:56,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.642e+02 1.818e+02 1.985e+02 2.610e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-12 04:57:15,489 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:57:39,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=946591.3333333334, ans=0.05 2023-10-12 04:57:56,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=946684.6666666666, ans=0.09899494936611666 2023-10-12 04:57:59,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=946684.6666666666, ans=0.125 2023-10-12 04:58:04,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=946731.3333333334, ans=0.0 2023-10-12 04:58:17,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=946778.0, ans=0.125 2023-10-12 04:58:26,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=946824.6666666666, ans=0.1 2023-10-12 04:58:26,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=946824.6666666666, ans=0.2 2023-10-12 04:58:31,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=946824.6666666666, ans=0.125 2023-10-12 04:58:32,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=946824.6666666666, ans=0.0 2023-10-12 04:58:32,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=946824.6666666666, ans=0.125 2023-10-12 04:58:44,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.825e+02 1.986e+02 2.243e+02 3.156e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-12 04:58:46,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-10-12 04:59:00,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946964.6666666666, ans=0.1 2023-10-12 04:59:09,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=946964.6666666666, ans=0.125 2023-10-12 04:59:15,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=12.0 2023-10-12 04:59:43,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=947104.6666666666, ans=0.125 2023-10-12 04:59:48,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=947104.6666666666, ans=0.1 2023-10-12 04:59:53,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=947104.6666666666, ans=0.1 2023-10-12 04:59:55,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=947151.3333333334, ans=0.125 2023-10-12 05:00:11,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947198.0, ans=0.125 2023-10-12 05:00:19,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=947198.0, ans=0.2 2023-10-12 05:00:29,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=947244.6666666666, ans=0.125 2023-10-12 05:00:35,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=15.0 2023-10-12 05:00:37,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=947291.3333333334, ans=0.125 2023-10-12 05:00:37,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=947291.3333333334, ans=0.04949747468305833 2023-10-12 05:00:56,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.759e+02 1.969e+02 2.208e+02 3.067e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-12 05:00:58,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.23 vs. limit=15.0 2023-10-12 05:01:03,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=947384.6666666666, ans=0.125 2023-10-12 05:01:09,020 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:01:16,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.76 vs. limit=22.5 2023-10-12 05:01:21,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=947478.0, ans=0.0 2023-10-12 05:01:31,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=947524.6666666666, ans=0.125 2023-10-12 05:01:33,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=947524.6666666666, ans=0.125 2023-10-12 05:02:02,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.24 vs. limit=10.0 2023-10-12 05:02:06,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=947618.0, ans=6.0 2023-10-12 05:02:08,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=947664.6666666666, ans=0.125 2023-10-12 05:02:12,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.27 vs. limit=15.0 2023-10-12 05:02:17,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=947711.3333333334, ans=0.125 2023-10-12 05:02:51,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.845e+02 2.014e+02 2.346e+02 3.364e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-12 05:02:57,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=22.5 2023-10-12 05:03:10,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=947898.0, ans=0.125 2023-10-12 05:03:15,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=12.0 2023-10-12 05:03:30,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947991.3333333334, ans=0.125 2023-10-12 05:03:38,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=948038.0, ans=0.125 2023-10-12 05:03:57,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=948084.6666666666, ans=0.2 2023-10-12 05:03:59,658 INFO [train.py:1031] (3/4) Epoch 15, batch 12000, loss[loss=0.2014, simple_loss=0.2956, pruned_loss=0.05357, over 16821.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.286, pruned_loss=0.05232, over 32695088.22 frames. ], batch size: 155, lr: 2.31e-03, grad_scale: 32.0 2023-10-12 05:04:07,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=948131.3333333334, ans=0.1 2023-10-12 05:04:32,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=948224.6666666666, ans=0.125 2023-10-12 05:04:42,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.764e+02 1.914e+02 2.129e+02 3.192e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-12 05:05:50,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=948551.3333333334, ans=0.125 2023-10-12 05:05:57,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=22.5 2023-10-12 05:06:41,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.670e+02 1.838e+02 2.068e+02 3.005e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-12 05:06:48,538 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:06:56,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=948831.3333333334, ans=0.0 2023-10-12 05:07:08,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=948878.0, ans=0.125 2023-10-12 05:07:11,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=948924.6666666666, ans=0.2 2023-10-12 05:07:13,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=948924.6666666666, ans=0.0 2023-10-12 05:07:14,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-10-12 05:07:45,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=949064.6666666666, ans=0.0 2023-10-12 05:08:17,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=949158.0, ans=0.0 2023-10-12 05:08:22,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=949204.6666666666, ans=0.0 2023-10-12 05:08:28,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=949204.6666666666, ans=0.125 2023-10-12 05:08:30,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.718e+02 1.869e+02 2.137e+02 3.010e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-12 05:08:34,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=949251.3333333334, ans=0.0 2023-10-12 05:08:36,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949251.3333333334, ans=0.1 2023-10-12 05:08:39,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=949251.3333333334, ans=0.0 2023-10-12 05:08:42,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=949298.0, ans=0.125 2023-10-12 05:08:56,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=949344.6666666666, ans=0.125 2023-10-12 05:09:09,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=949391.3333333334, ans=0.125 2023-10-12 05:09:11,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=949438.0, ans=0.125 2023-10-12 05:09:17,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=949438.0, ans=0.0 2023-10-12 05:09:25,824 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-12 05:09:37,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=949531.3333333334, ans=0.0 2023-10-12 05:09:39,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=949531.3333333334, ans=0.125 2023-10-12 05:09:57,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=949624.6666666666, ans=0.125 2023-10-12 05:10:14,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=949671.3333333334, ans=0.0 2023-10-12 05:10:21,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.772e+02 1.899e+02 2.241e+02 2.900e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-12 05:10:21,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=949718.0, ans=0.125 2023-10-12 05:10:26,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=949718.0, ans=0.125 2023-10-12 05:11:05,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=949858.0, ans=0.125 2023-10-12 05:11:05,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.69 vs. limit=15.0 2023-10-12 05:11:13,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=949904.6666666666, ans=0.2 2023-10-12 05:11:19,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=949904.6666666666, ans=0.0 2023-10-12 05:11:19,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=949904.6666666666, ans=0.125 2023-10-12 05:11:26,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.10 vs. limit=10.0 2023-10-12 05:11:30,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949951.3333333334, ans=0.1 2023-10-12 05:11:42,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=950044.6666666666, ans=0.125 2023-10-12 05:11:51,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=950044.6666666666, ans=0.125 2023-10-12 05:11:55,335 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=22.5 2023-10-12 05:12:16,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.672e+02 1.825e+02 2.013e+02 2.874e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-12 05:12:17,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.95 vs. limit=10.0 2023-10-12 05:12:23,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=950184.6666666666, ans=0.0 2023-10-12 05:12:31,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=950231.3333333334, ans=0.0 2023-10-12 05:12:32,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=950231.3333333334, ans=0.0 2023-10-12 05:12:37,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=950231.3333333334, ans=0.125 2023-10-12 05:12:50,489 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:12:53,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.49 vs. limit=10.0 2023-10-12 05:13:02,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=22.5 2023-10-12 05:13:24,978 INFO [train.py:1031] (3/4) Epoch 15, batch 12500, loss[loss=0.182, simple_loss=0.2796, pruned_loss=0.04226, over 16933.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2856, pruned_loss=0.05226, over 32721143.85 frames. ], batch size: 77, lr: 2.30e-03, grad_scale: 8.0 2023-10-12 05:13:26,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=950464.6666666666, ans=0.125 2023-10-12 05:13:37,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=950511.3333333334, ans=0.125 2023-10-12 05:13:45,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=950558.0, ans=0.0 2023-10-12 05:13:47,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950558.0, ans=0.1 2023-10-12 05:13:57,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950604.6666666666, ans=0.1 2023-10-12 05:14:03,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=950604.6666666666, ans=0.0 2023-10-12 05:14:09,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.293e+02 1.658e+02 1.823e+02 2.079e+02 2.963e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-12 05:14:15,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=950651.3333333334, ans=0.0 2023-10-12 05:14:33,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=950744.6666666666, ans=0.0 2023-10-12 05:14:52,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-10-12 05:15:36,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=950978.0, ans=0.2 2023-10-12 05:15:57,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=951071.3333333334, ans=0.0 2023-10-12 05:16:02,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=951118.0, ans=0.0 2023-10-12 05:16:03,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951118.0, ans=0.1 2023-10-12 05:16:05,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.702e+02 1.864e+02 2.060e+02 3.066e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-12 05:16:14,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=951164.6666666666, ans=0.125 2023-10-12 05:16:16,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951164.6666666666, ans=0.1 2023-10-12 05:16:27,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=951211.3333333334, ans=0.125 2023-10-12 05:16:34,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=951211.3333333334, ans=0.5 2023-10-12 05:16:46,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=951258.0, ans=0.125 2023-10-12 05:16:47,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=951304.6666666666, ans=0.125 2023-10-12 05:17:04,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=951351.3333333334, ans=0.125 2023-10-12 05:17:23,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951444.6666666666, ans=0.1 2023-10-12 05:17:34,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=951491.3333333334, ans=0.125 2023-10-12 05:17:34,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951491.3333333334, ans=0.1 2023-10-12 05:17:52,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=951584.6666666666, ans=0.125 2023-10-12 05:17:54,189 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.738e+02 1.912e+02 2.127e+02 3.551e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-12 05:17:59,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=951584.6666666666, ans=0.0 2023-10-12 05:18:02,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=951631.3333333334, ans=0.125 2023-10-12 05:18:07,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=951631.3333333334, ans=0.125 2023-10-12 05:18:16,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=951678.0, ans=0.05 2023-10-12 05:18:17,260 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:18:20,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=951678.0, ans=0.0 2023-10-12 05:18:21,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=951678.0, ans=0.025 2023-10-12 05:18:41,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=951771.3333333334, ans=0.125 2023-10-12 05:18:45,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=951771.3333333334, ans=0.125 2023-10-12 05:18:45,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=951771.3333333334, ans=0.125 2023-10-12 05:19:03,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.63 vs. limit=15.0 2023-10-12 05:19:18,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-12 05:19:38,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=952004.6666666666, ans=0.125 2023-10-12 05:19:38,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=952004.6666666666, ans=0.125 2023-10-12 05:19:42,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.780e+02 1.992e+02 2.295e+02 3.597e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-12 05:19:53,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=952098.0, ans=0.07 2023-10-12 05:20:03,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=952098.0, ans=0.0 2023-10-12 05:20:07,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=952144.6666666666, ans=0.1 2023-10-12 05:20:34,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=952238.0, ans=0.125 2023-10-12 05:20:38,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952284.6666666666, ans=0.125 2023-10-12 05:20:42,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.25 vs. limit=15.0 2023-10-12 05:20:52,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=952331.3333333334, ans=0.125 2023-10-12 05:21:01,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=952378.0, ans=0.125 2023-10-12 05:21:02,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=952378.0, ans=0.125 2023-10-12 05:21:02,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2023-10-12 05:21:04,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-10-12 05:21:05,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=952378.0, ans=0.125 2023-10-12 05:21:07,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=15.0 2023-10-12 05:21:26,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-10-12 05:21:26,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.00 vs. limit=12.0 2023-10-12 05:21:33,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.728e+02 1.930e+02 2.211e+02 2.796e+02, threshold=3.861e+02, percent-clipped=0.0 2023-10-12 05:21:35,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=952518.0, ans=0.0 2023-10-12 05:21:40,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952518.0, ans=0.1 2023-10-12 05:21:40,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.44 vs. limit=15.0 2023-10-12 05:21:45,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=952564.6666666666, ans=0.125 2023-10-12 05:21:49,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=952564.6666666666, ans=0.125 2023-10-12 05:21:49,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2023-10-12 05:21:56,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=952611.3333333334, ans=0.125 2023-10-12 05:22:09,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=952658.0, ans=0.2 2023-10-12 05:22:31,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-10-12 05:22:36,420 INFO [train.py:1031] (3/4) Epoch 15, batch 13000, loss[loss=0.1967, simple_loss=0.2916, pruned_loss=0.05093, over 16048.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2864, pruned_loss=0.05261, over 32719677.34 frames. ], batch size: 43, lr: 2.30e-03, grad_scale: 16.0 2023-10-12 05:22:45,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=952798.0, ans=0.125 2023-10-12 05:22:45,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=952798.0, ans=0.125 2023-10-12 05:23:02,873 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=15.0 2023-10-12 05:23:19,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-10-12 05:23:19,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.59 vs. limit=15.0 2023-10-12 05:23:20,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=952938.0, ans=0.2 2023-10-12 05:23:29,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-10-12 05:23:31,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.699e+02 1.882e+02 2.096e+02 5.250e+02, threshold=3.765e+02, percent-clipped=1.0 2023-10-12 05:23:39,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=952984.6666666666, ans=0.0 2023-10-12 05:23:52,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=953078.0, ans=0.2 2023-10-12 05:23:54,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-10-12 05:23:56,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=953078.0, ans=0.125 2023-10-12 05:23:57,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=953078.0, ans=0.125 2023-10-12 05:24:04,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=953078.0, ans=15.0 2023-10-12 05:24:16,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=953124.6666666666, ans=0.0 2023-10-12 05:24:19,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=953171.3333333334, ans=0.0 2023-10-12 05:24:23,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=953171.3333333334, ans=0.0 2023-10-12 05:24:42,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=953264.6666666666, ans=0.05 2023-10-12 05:24:44,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=8.0 2023-10-12 05:24:44,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=953264.6666666666, ans=0.125 2023-10-12 05:24:46,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.99 vs. limit=15.0 2023-10-12 05:24:47,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=953264.6666666666, ans=0.04949747468305833 2023-10-12 05:25:00,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953358.0, ans=0.1 2023-10-12 05:25:12,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953404.6666666666, ans=0.1 2023-10-12 05:25:22,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=15.0 2023-10-12 05:25:26,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.657e+02 1.796e+02 1.995e+02 2.607e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-12 05:25:36,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=953498.0, ans=0.0 2023-10-12 05:25:41,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=953498.0, ans=0.2 2023-10-12 05:25:43,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=953498.0, ans=0.125 2023-10-12 05:25:50,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=953544.6666666666, ans=0.0 2023-10-12 05:25:54,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953544.6666666666, ans=0.1 2023-10-12 05:26:01,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=953591.3333333334, ans=0.125 2023-10-12 05:26:12,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=953638.0, ans=0.125 2023-10-12 05:26:18,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=953638.0, ans=0.2 2023-10-12 05:26:28,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=953684.6666666666, ans=0.125 2023-10-12 05:26:32,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=953684.6666666666, ans=0.125 2023-10-12 05:27:01,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=953778.0, ans=0.125 2023-10-12 05:27:04,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=953778.0, ans=0.125 2023-10-12 05:27:14,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=953824.6666666666, ans=0.0 2023-10-12 05:27:21,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=953871.3333333334, ans=0.2 2023-10-12 05:27:26,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=953871.3333333334, ans=0.125 2023-10-12 05:27:28,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-12 05:27:28,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2023-10-12 05:27:31,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.649e+02 1.804e+02 1.974e+02 2.803e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-12 05:27:34,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=953918.0, ans=0.125 2023-10-12 05:27:38,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=953964.6666666666, ans=0.2 2023-10-12 05:27:42,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-10-12 05:27:56,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-10-12 05:28:11,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-12 05:28:14,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=954104.6666666666, ans=0.0 2023-10-12 05:28:16,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.78 vs. limit=10.0 2023-10-12 05:28:53,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=954244.6666666666, ans=0.125 2023-10-12 05:28:55,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=954291.3333333334, ans=0.1 2023-10-12 05:29:05,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=954291.3333333334, ans=0.0 2023-10-12 05:29:09,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=954338.0, ans=0.2 2023-10-12 05:29:12,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=954338.0, ans=0.125 2023-10-12 05:29:15,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=954338.0, ans=0.125 2023-10-12 05:29:19,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.811e+02 1.969e+02 2.169e+02 3.755e+02, threshold=3.937e+02, percent-clipped=1.0 2023-10-12 05:29:24,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=954384.6666666666, ans=0.0 2023-10-12 05:29:32,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=954431.3333333334, ans=0.0 2023-10-12 05:29:36,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=954431.3333333334, ans=0.0 2023-10-12 05:29:40,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=954478.0, ans=0.125 2023-10-12 05:29:40,295 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-10-12 05:30:10,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=954571.3333333334, ans=0.09899494936611666 2023-10-12 05:30:33,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-10-12 05:30:34,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=954711.3333333334, ans=0.0 2023-10-12 05:30:35,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=954711.3333333334, ans=0.1 2023-10-12 05:30:51,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=954758.0, ans=0.0 2023-10-12 05:31:01,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-12 05:31:05,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=954804.6666666666, ans=0.0 2023-10-12 05:31:09,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.696e+02 1.863e+02 2.056e+02 3.087e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-12 05:31:24,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=954898.0, ans=0.5 2023-10-12 05:31:28,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=954944.6666666666, ans=0.0 2023-10-12 05:31:29,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=954944.6666666666, ans=0.0 2023-10-12 05:31:36,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=22.5 2023-10-12 05:31:41,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-10-12 05:31:54,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=955038.0, ans=0.05 2023-10-12 05:32:05,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=955084.6666666666, ans=0.1 2023-10-12 05:32:12,623 INFO [train.py:1031] (3/4) Epoch 15, batch 13500, loss[loss=0.1962, simple_loss=0.2833, pruned_loss=0.05456, over 16678.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2859, pruned_loss=0.05257, over 32743378.86 frames. ], batch size: 56, lr: 2.30e-03, grad_scale: 32.0 2023-10-12 05:32:15,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=955131.3333333334, ans=0.125 2023-10-12 05:32:16,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=955131.3333333334, ans=0.0 2023-10-12 05:32:17,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-10-12 05:32:31,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=955178.0, ans=0.125 2023-10-12 05:32:33,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=955224.6666666666, ans=0.125 2023-10-12 05:32:35,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-10-12 05:32:45,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=955271.3333333334, ans=0.0 2023-10-12 05:32:58,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.671e+02 1.841e+02 2.069e+02 3.350e+02, threshold=3.682e+02, percent-clipped=0.0 2023-10-12 05:33:19,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=955411.3333333334, ans=0.125 2023-10-12 05:33:34,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=955458.0, ans=0.0 2023-10-12 05:33:36,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=955458.0, ans=0.125 2023-10-12 05:33:51,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=955551.3333333334, ans=0.125 2023-10-12 05:34:04,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=955598.0, ans=0.2 2023-10-12 05:34:09,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=955598.0, ans=0.125 2023-10-12 05:34:28,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-10-12 05:34:40,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=955784.6666666666, ans=0.0 2023-10-12 05:34:43,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.705e+02 1.879e+02 2.090e+02 3.403e+02, threshold=3.758e+02, percent-clipped=0.0 2023-10-12 05:34:49,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=955831.3333333334, ans=0.125 2023-10-12 05:35:27,649 INFO [train.py:1031] (3/4) Epoch 16, batch 0, loss[loss=0.1702, simple_loss=0.2625, pruned_loss=0.03898, over 16931.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2625, pruned_loss=0.03898, over 16931.00 frames. ], batch size: 72, lr: 2.22e-03, grad_scale: 32.0 2023-10-12 05:35:27,650 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-12 05:35:36,636 INFO [train.py:1063] (3/4) Epoch 16, validation: loss=0.2168, simple_loss=0.3041, pruned_loss=0.06475, over 1020973.00 frames. 2023-10-12 05:35:36,637 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-12 05:35:43,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=955854.6666666666, ans=0.125 2023-10-12 05:36:00,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=955948.0, ans=22.5 2023-10-12 05:36:19,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=955994.6666666666, ans=0.2 2023-10-12 05:36:31,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=956041.3333333334, ans=0.125 2023-10-12 05:36:36,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=956088.0, ans=0.0 2023-10-12 05:36:46,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2023-10-12 05:36:50,112 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-10-12 05:37:16,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.710e+02 1.856e+02 2.150e+02 3.512e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 05:37:21,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=956274.6666666666, ans=0.1 2023-10-12 05:37:52,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=956414.6666666666, ans=0.125 2023-10-12 05:38:02,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=956461.3333333334, ans=0.0 2023-10-12 05:38:39,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=956601.3333333334, ans=0.0 2023-10-12 05:38:43,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=956601.3333333334, ans=0.125 2023-10-12 05:38:59,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=956694.6666666666, ans=0.125 2023-10-12 05:39:06,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.679e+02 1.799e+02 2.010e+02 3.014e+02, threshold=3.598e+02, percent-clipped=0.0 2023-10-12 05:39:15,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=22.5 2023-10-12 05:39:21,837 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-10-12 05:39:27,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=956834.6666666666, ans=0.0 2023-10-12 05:39:33,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=956834.6666666666, ans=0.2 2023-10-12 05:39:34,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-12 05:39:56,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-10-12 05:40:16,013 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:40:21,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957021.3333333334, ans=0.0 2023-10-12 05:40:40,363 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.261e-01 2023-10-12 05:40:41,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2023-10-12 05:40:46,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=957161.3333333334, ans=0.125 2023-10-12 05:40:57,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.751e+02 1.952e+02 2.208e+02 3.252e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-12 05:41:01,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957208.0, ans=0.1 2023-10-12 05:41:03,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=957208.0, ans=0.0 2023-10-12 05:41:08,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=957254.6666666666, ans=0.0 2023-10-12 05:41:08,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=957254.6666666666, ans=0.05 2023-10-12 05:41:09,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=957254.6666666666, ans=0.0 2023-10-12 05:41:24,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-10-12 05:41:37,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2023-10-12 05:41:59,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=957441.3333333334, ans=0.0 2023-10-12 05:42:00,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957441.3333333334, ans=0.125 2023-10-12 05:42:15,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=957534.6666666666, ans=0.0 2023-10-12 05:42:43,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=15.0 2023-10-12 05:42:45,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.745e+02 1.983e+02 2.279e+02 3.520e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-12 05:42:51,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.52 vs. limit=12.0 2023-10-12 05:43:12,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=957768.0, ans=0.125 2023-10-12 05:43:14,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=957768.0, ans=0.125 2023-10-12 05:43:40,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=957861.3333333334, ans=0.5 2023-10-12 05:43:52,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=957908.0, ans=0.125 2023-10-12 05:44:05,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=957954.6666666666, ans=0.0 2023-10-12 05:44:14,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=958001.3333333334, ans=0.125 2023-10-12 05:44:42,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.718e+02 1.856e+02 2.058e+02 2.918e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 05:44:46,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=958141.3333333334, ans=0.1 2023-10-12 05:44:55,279 INFO [train.py:1031] (3/4) Epoch 16, batch 500, loss[loss=0.1787, simple_loss=0.2782, pruned_loss=0.03956, over 16017.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2865, pruned_loss=0.05261, over 7308287.33 frames. ], batch size: 43, lr: 2.22e-03, grad_scale: 32.0 2023-10-12 05:44:57,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=958188.0, ans=0.1 2023-10-12 05:45:32,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.25 vs. limit=15.0 2023-10-12 05:45:35,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=958328.0, ans=0.125 2023-10-12 05:45:59,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=958421.3333333334, ans=0.0 2023-10-12 05:46:18,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=958514.6666666666, ans=0.0 2023-10-12 05:46:32,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=958561.3333333334, ans=0.125 2023-10-12 05:46:34,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.771e+02 1.982e+02 2.231e+02 2.910e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-12 05:46:48,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=958654.6666666666, ans=0.0 2023-10-12 05:46:49,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=958654.6666666666, ans=0.125 2023-10-12 05:46:59,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=958701.3333333334, ans=0.125 2023-10-12 05:47:11,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-12 05:47:29,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=958841.3333333334, ans=0.125 2023-10-12 05:47:49,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=958888.0, ans=15.0 2023-10-12 05:47:55,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958934.6666666666, ans=0.1 2023-10-12 05:47:55,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-10-12 05:48:00,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-10-12 05:48:01,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=958934.6666666666, ans=0.125 2023-10-12 05:48:19,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959028.0, ans=0.1 2023-10-12 05:48:28,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=959074.6666666666, ans=0.1 2023-10-12 05:48:28,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.796e+02 2.036e+02 2.346e+02 3.753e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-12 05:48:30,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=959074.6666666666, ans=0.125 2023-10-12 05:48:45,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=959121.3333333334, ans=0.0 2023-10-12 05:48:56,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=959168.0, ans=0.2 2023-10-12 05:49:06,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=959214.6666666666, ans=0.125 2023-10-12 05:49:45,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=959354.6666666666, ans=0.125 2023-10-12 05:49:53,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=959401.3333333334, ans=0.1 2023-10-12 05:50:04,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=959448.0, ans=0.125 2023-10-12 05:50:13,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=959494.6666666666, ans=0.125 2023-10-12 05:50:17,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=959494.6666666666, ans=0.125 2023-10-12 05:50:20,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.755e+02 1.970e+02 2.252e+02 3.215e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-12 05:51:05,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=959681.3333333334, ans=0.125 2023-10-12 05:51:06,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=959728.0, ans=0.07 2023-10-12 05:51:26,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=959774.6666666666, ans=0.0 2023-10-12 05:51:34,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=959821.3333333334, ans=0.2 2023-10-12 05:51:36,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=959821.3333333334, ans=0.035 2023-10-12 05:51:42,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=959868.0, ans=0.0 2023-10-12 05:51:59,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=959914.6666666666, ans=0.125 2023-10-12 05:52:16,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.754e+02 1.921e+02 2.186e+02 3.689e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-12 05:52:18,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960008.0, ans=0.1 2023-10-12 05:52:25,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=960054.6666666666, ans=0.0 2023-10-12 05:53:17,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-12 05:53:29,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=960288.0, ans=0.2 2023-10-12 05:53:29,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.90 vs. limit=6.0 2023-10-12 05:53:30,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=960288.0, ans=0.2 2023-10-12 05:53:52,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=960381.3333333334, ans=0.0 2023-10-12 05:54:06,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=960474.6666666666, ans=0.125 2023-10-12 05:54:08,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.722e+02 1.867e+02 2.124e+02 3.433e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-12 05:54:18,363 INFO [train.py:1031] (3/4) Epoch 16, batch 1000, loss[loss=0.1939, simple_loss=0.2882, pruned_loss=0.04981, over 16833.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2871, pruned_loss=0.05308, over 12945982.78 frames. ], batch size: 146, lr: 2.21e-03, grad_scale: 16.0 2023-10-12 05:54:34,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=960568.0, ans=0.125 2023-10-12 05:54:58,935 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-10-12 05:55:11,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.91 vs. limit=6.0 2023-10-12 05:55:13,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=960754.6666666666, ans=0.07 2023-10-12 05:55:24,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-10-12 05:55:46,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=960894.6666666666, ans=0.0 2023-10-12 05:55:53,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.660e+02 1.854e+02 2.039e+02 2.957e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 05:55:54,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=960941.3333333334, ans=0.125 2023-10-12 05:55:56,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-10-12 05:56:00,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=960941.3333333334, ans=0.0 2023-10-12 05:56:04,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-10-12 05:56:05,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=960988.0, ans=0.125 2023-10-12 05:56:28,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=961081.3333333334, ans=0.95 2023-10-12 05:56:39,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=961081.3333333334, ans=0.04949747468305833 2023-10-12 05:56:49,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=961128.0, ans=0.2 2023-10-12 05:57:07,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=961221.3333333334, ans=0.2 2023-10-12 05:57:29,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=961268.0, ans=0.0 2023-10-12 05:57:33,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=961314.6666666666, ans=0.0 2023-10-12 05:57:37,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=961314.6666666666, ans=0.125 2023-10-12 05:57:50,686 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:57:57,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.323e+02 1.715e+02 1.883e+02 2.074e+02 2.784e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 05:58:01,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=961408.0, ans=0.0 2023-10-12 05:58:10,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=961454.6666666666, ans=0.2 2023-10-12 05:58:17,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=961501.3333333334, ans=0.0 2023-10-12 05:58:22,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=961501.3333333334, ans=0.125 2023-10-12 05:58:40,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961594.6666666666, ans=0.125 2023-10-12 05:58:46,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=12.0 2023-10-12 05:58:48,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=961594.6666666666, ans=0.0 2023-10-12 05:58:48,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=961594.6666666666, ans=0.125 2023-10-12 05:58:52,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=961641.3333333334, ans=0.0 2023-10-12 05:59:13,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=961734.6666666666, ans=15.0 2023-10-12 05:59:21,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=961781.3333333334, ans=0.0 2023-10-12 05:59:44,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.340e+02 1.719e+02 1.920e+02 2.178e+02 2.949e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-12 05:59:44,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=961874.6666666666, ans=0.125 2023-10-12 05:59:44,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=961874.6666666666, ans=0.125 2023-10-12 06:00:34,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=962061.3333333334, ans=0.125 2023-10-12 06:00:35,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-10-12 06:00:37,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=962108.0, ans=0.125 2023-10-12 06:00:39,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=962108.0, ans=0.125 2023-10-12 06:00:41,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=962108.0, ans=0.0 2023-10-12 06:01:02,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=962201.3333333334, ans=0.2 2023-10-12 06:01:24,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=962294.6666666666, ans=0.125 2023-10-12 06:01:27,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=962294.6666666666, ans=0.125 2023-10-12 06:01:33,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.691e+02 1.874e+02 2.047e+02 3.026e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-12 06:01:41,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=962388.0, ans=0.0 2023-10-12 06:01:53,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=962388.0, ans=0.125 2023-10-12 06:02:04,096 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:02:08,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-10-12 06:02:10,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=962481.3333333334, ans=0.125 2023-10-12 06:02:14,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=962481.3333333334, ans=0.125 2023-10-12 06:02:36,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-12 06:02:48,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962621.3333333334, ans=0.1 2023-10-12 06:02:52,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962668.0, ans=0.1 2023-10-12 06:02:52,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=962668.0, ans=0.1 2023-10-12 06:03:12,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=962714.6666666666, ans=0.125 2023-10-12 06:03:14,913 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:03:25,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.747e+02 1.954e+02 2.196e+02 3.606e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-12 06:03:28,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=962808.0, ans=15.0 2023-10-12 06:03:37,686 INFO [train.py:1031] (3/4) Epoch 16, batch 1500, loss[loss=0.1726, simple_loss=0.2658, pruned_loss=0.03969, over 16907.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2855, pruned_loss=0.05252, over 17342695.40 frames. ], batch size: 87, lr: 2.21e-03, grad_scale: 16.0 2023-10-12 06:03:38,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=962854.6666666666, ans=0.0 2023-10-12 06:03:55,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=962901.3333333334, ans=0.0 2023-10-12 06:04:12,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=962994.6666666666, ans=15.0 2023-10-12 06:04:20,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=962994.6666666666, ans=0.0 2023-10-12 06:04:23,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=963041.3333333334, ans=0.0 2023-10-12 06:04:33,385 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:04:55,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963181.3333333334, ans=0.1 2023-10-12 06:05:21,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.704e+02 1.880e+02 2.121e+02 3.433e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-12 06:05:31,067 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-10-12 06:05:49,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=963368.0, ans=0.125 2023-10-12 06:06:06,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.33 vs. limit=15.0 2023-10-12 06:06:21,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.24 vs. limit=15.0 2023-10-12 06:06:35,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963508.0, ans=0.1 2023-10-12 06:06:56,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=963601.3333333334, ans=0.0 2023-10-12 06:06:57,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=963601.3333333334, ans=0.125 2023-10-12 06:07:24,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.668e+02 1.813e+02 2.065e+02 2.663e+02, threshold=3.626e+02, percent-clipped=0.0 2023-10-12 06:07:34,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=963788.0, ans=0.125 2023-10-12 06:07:59,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=963881.3333333334, ans=0.125 2023-10-12 06:08:06,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.58 vs. limit=22.5 2023-10-12 06:08:12,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=963928.0, ans=0.0 2023-10-12 06:08:20,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=963974.6666666666, ans=0.1 2023-10-12 06:08:33,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-10-12 06:08:38,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=964068.0, ans=0.125 2023-10-12 06:08:41,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=964068.0, ans=0.0 2023-10-12 06:08:47,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=964114.6666666666, ans=0.125 2023-10-12 06:09:17,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.749e+02 1.885e+02 2.087e+02 2.638e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-12 06:09:39,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=964301.3333333334, ans=0.2 2023-10-12 06:09:49,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=964348.0, ans=0.125 2023-10-12 06:10:10,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964441.3333333334, ans=0.1 2023-10-12 06:10:15,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-10-12 06:10:55,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=964628.0, ans=0.07 2023-10-12 06:11:08,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.682e+02 1.824e+02 1.948e+02 2.360e+02, threshold=3.648e+02, percent-clipped=0.0 2023-10-12 06:11:17,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=964721.3333333334, ans=0.125 2023-10-12 06:11:23,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=964721.3333333334, ans=0.125 2023-10-12 06:11:24,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=964721.3333333334, ans=0.0 2023-10-12 06:11:29,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=964768.0, ans=0.2 2023-10-12 06:11:42,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=964814.6666666666, ans=0.125 2023-10-12 06:11:51,915 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:11:53,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=964861.3333333334, ans=0.125 2023-10-12 06:12:06,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=12.0 2023-10-12 06:12:12,225 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.02 vs. limit=15.0 2023-10-12 06:12:25,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=964954.6666666666, ans=0.125 2023-10-12 06:12:42,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=965001.3333333334, ans=0.09899494936611666 2023-10-12 06:12:52,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=965048.0, ans=0.125 2023-10-12 06:13:11,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.657e+02 1.830e+02 2.007e+02 2.715e+02, threshold=3.661e+02, percent-clipped=0.0 2023-10-12 06:13:18,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.49 vs. limit=15.0 2023-10-12 06:13:22,107 INFO [train.py:1031] (3/4) Epoch 16, batch 2000, loss[loss=0.2052, simple_loss=0.302, pruned_loss=0.05422, over 16565.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.286, pruned_loss=0.05244, over 20756156.23 frames. ], batch size: 266, lr: 2.21e-03, grad_scale: 32.0 2023-10-12 06:13:26,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.98 vs. limit=15.0 2023-10-12 06:13:27,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=965188.0, ans=0.2 2023-10-12 06:14:02,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=965328.0, ans=0.2 2023-10-12 06:14:07,989 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.85 vs. limit=15.0 2023-10-12 06:14:21,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=965374.6666666666, ans=0.0 2023-10-12 06:14:30,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=965421.3333333334, ans=0.025 2023-10-12 06:14:36,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=965421.3333333334, ans=0.2 2023-10-12 06:14:40,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=965468.0, ans=0.0 2023-10-12 06:14:43,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.70 vs. limit=10.0 2023-10-12 06:14:43,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=965468.0, ans=0.125 2023-10-12 06:15:14,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.732e+02 1.944e+02 2.279e+02 3.575e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 06:15:26,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=965654.6666666666, ans=0.1 2023-10-12 06:15:27,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=965654.6666666666, ans=0.05 2023-10-12 06:15:27,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=965654.6666666666, ans=0.0 2023-10-12 06:15:37,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=965654.6666666666, ans=6.0 2023-10-12 06:15:42,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=965654.6666666666, ans=0.125 2023-10-12 06:16:20,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-10-12 06:16:33,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=965841.3333333334, ans=0.1 2023-10-12 06:16:38,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=965841.3333333334, ans=0.125 2023-10-12 06:16:45,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=965888.0, ans=0.0 2023-10-12 06:17:34,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=966074.6666666666, ans=0.1 2023-10-12 06:17:37,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.813e+02 1.953e+02 2.190e+02 3.018e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-12 06:17:43,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=966074.6666666666, ans=0.2 2023-10-12 06:18:06,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-10-12 06:18:15,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=966214.6666666666, ans=0.125 2023-10-12 06:18:46,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=966354.6666666666, ans=0.125 2023-10-12 06:18:47,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=966354.6666666666, ans=0.125 2023-10-12 06:19:04,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=966448.0, ans=0.125 2023-10-12 06:19:08,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=966448.0, ans=0.125 2023-10-12 06:19:28,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=966541.3333333334, ans=0.125 2023-10-12 06:19:30,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.721e+02 1.921e+02 2.277e+02 2.970e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 06:19:34,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=966541.3333333334, ans=0.1 2023-10-12 06:19:50,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=966634.6666666666, ans=0.0 2023-10-12 06:19:50,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=966634.6666666666, ans=0.125 2023-10-12 06:20:24,976 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.94 vs. limit=15.0 2023-10-12 06:20:29,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=966774.6666666666, ans=0.125 2023-10-12 06:20:44,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=966821.3333333334, ans=0.125 2023-10-12 06:20:55,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=966868.0, ans=0.125 2023-10-12 06:20:56,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=966868.0, ans=0.125 2023-10-12 06:20:56,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=966868.0, ans=0.125 2023-10-12 06:21:22,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.792e+02 1.915e+02 2.154e+02 3.344e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-12 06:21:35,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=967054.6666666666, ans=0.0 2023-10-12 06:21:38,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-10-12 06:21:40,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-10-12 06:21:54,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.16 vs. limit=15.0 2023-10-12 06:22:00,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2023-10-12 06:22:11,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=967194.6666666666, ans=0.2 2023-10-12 06:22:26,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=967288.0, ans=0.04949747468305833 2023-10-12 06:22:36,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=967334.6666666666, ans=0.0 2023-10-12 06:23:11,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=967474.6666666666, ans=0.025 2023-10-12 06:23:12,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.743e+02 1.963e+02 2.200e+02 2.708e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-12 06:23:18,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=967521.3333333334, ans=0.125 2023-10-12 06:23:18,877 INFO [train.py:1031] (3/4) Epoch 16, batch 2500, loss[loss=0.2008, simple_loss=0.2907, pruned_loss=0.05544, over 16934.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2861, pruned_loss=0.05254, over 23445958.53 frames. ], batch size: 130, lr: 2.21e-03, grad_scale: 32.0 2023-10-12 06:23:36,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=967568.0, ans=0.2 2023-10-12 06:23:41,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=967614.6666666666, ans=0.125 2023-10-12 06:24:03,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=967708.0, ans=0.0 2023-10-12 06:24:14,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=967754.6666666666, ans=0.0 2023-10-12 06:24:43,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=967894.6666666666, ans=0.125 2023-10-12 06:24:58,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.731e+02 1.875e+02 2.109e+02 3.032e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-12 06:25:19,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=968034.6666666666, ans=0.2 2023-10-12 06:25:27,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=968034.6666666666, ans=0.125 2023-10-12 06:25:32,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=968081.3333333334, ans=0.0 2023-10-12 06:25:36,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=968081.3333333334, ans=0.0 2023-10-12 06:25:47,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.86 vs. limit=15.0 2023-10-12 06:25:54,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=968174.6666666666, ans=0.125 2023-10-12 06:26:07,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=12.0 2023-10-12 06:26:35,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=968361.3333333334, ans=0.0 2023-10-12 06:26:36,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=968361.3333333334, ans=0.05 2023-10-12 06:26:40,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=968361.3333333334, ans=0.95 2023-10-12 06:26:49,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.667e+02 1.847e+02 2.043e+02 3.019e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-12 06:26:50,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=968408.0, ans=0.1 2023-10-12 06:27:00,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-12 06:27:08,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=968454.6666666666, ans=0.125 2023-10-12 06:27:43,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=968641.3333333334, ans=0.125 2023-10-12 06:28:26,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=968781.3333333334, ans=0.125 2023-10-12 06:28:40,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=968828.0, ans=0.125 2023-10-12 06:28:54,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.664e+02 1.825e+02 2.112e+02 4.311e+02, threshold=3.651e+02, percent-clipped=1.0 2023-10-12 06:29:05,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-10-12 06:29:20,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=22.5 2023-10-12 06:29:32,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2023-10-12 06:29:42,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=969061.3333333334, ans=0.125 2023-10-12 06:29:59,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=969108.0, ans=0.0 2023-10-12 06:30:11,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.43 vs. limit=15.0 2023-10-12 06:30:50,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=969294.6666666666, ans=0.1 2023-10-12 06:31:01,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.677e+02 1.872e+02 2.144e+02 2.805e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-12 06:31:18,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=969388.0, ans=0.0 2023-10-12 06:31:25,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=969434.6666666666, ans=0.125 2023-10-12 06:31:31,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=969481.3333333334, ans=0.0 2023-10-12 06:31:47,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.03 vs. limit=15.0 2023-10-12 06:32:04,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-10-12 06:32:10,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=969621.3333333334, ans=0.125 2023-10-12 06:32:10,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.98 vs. limit=10.0 2023-10-12 06:32:19,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=969668.0, ans=0.125 2023-10-12 06:32:33,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-10-12 06:32:48,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=969808.0, ans=0.125 2023-10-12 06:32:52,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.654e+02 1.837e+02 2.020e+02 2.615e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-12 06:32:56,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-12 06:32:59,726 INFO [train.py:1031] (3/4) Epoch 16, batch 3000, loss[loss=0.1681, simple_loss=0.2361, pruned_loss=0.05009, over 12464.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2854, pruned_loss=0.05256, over 25524787.12 frames. ], batch size: 440, lr: 2.20e-03, grad_scale: 32.0 2023-10-12 06:33:45,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=970041.3333333334, ans=0.07 2023-10-12 06:34:00,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2023-10-12 06:34:14,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=970134.6666666666, ans=0.0 2023-10-12 06:34:21,885 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:34:42,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.781e+02 1.934e+02 2.170e+02 3.723e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-12 06:34:45,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970274.6666666666, ans=0.125 2023-10-12 06:34:49,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970321.3333333334, ans=0.0 2023-10-12 06:35:01,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-10-12 06:35:18,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.33 vs. limit=15.0 2023-10-12 06:35:21,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=970414.6666666666, ans=0.1 2023-10-12 06:35:51,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=970554.6666666666, ans=0.2 2023-10-12 06:35:54,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=970554.6666666666, ans=0.2 2023-10-12 06:36:09,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=970648.0, ans=0.125 2023-10-12 06:36:18,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=970648.0, ans=0.0 2023-10-12 06:36:29,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=970694.6666666666, ans=0.125 2023-10-12 06:36:41,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.711e+02 1.913e+02 2.244e+02 3.214e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-12 06:36:43,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=970741.3333333334, ans=0.2 2023-10-12 06:36:47,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=970788.0, ans=0.09899494936611666 2023-10-12 06:37:00,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=970834.6666666666, ans=0.0 2023-10-12 06:37:33,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=970928.0, ans=0.1 2023-10-12 06:38:46,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.730e+02 1.929e+02 2.214e+02 3.129e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 06:38:54,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=971254.6666666666, ans=0.0 2023-10-12 06:39:00,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-12 06:39:11,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=22.5 2023-10-12 06:39:19,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=971348.0, ans=0.0 2023-10-12 06:39:30,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=971394.6666666666, ans=0.1 2023-10-12 06:39:31,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=971394.6666666666, ans=0.2 2023-10-12 06:40:11,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.65 vs. limit=22.5 2023-10-12 06:40:21,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-10-12 06:40:39,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.766e+02 1.927e+02 2.174e+02 2.934e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-12 06:40:49,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=971721.3333333334, ans=0.125 2023-10-12 06:41:26,099 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:41:36,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=971908.0, ans=0.07 2023-10-12 06:41:43,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=971954.6666666666, ans=0.125 2023-10-12 06:41:57,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=972001.3333333334, ans=0.125 2023-10-12 06:42:05,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=972048.0, ans=0.2 2023-10-12 06:42:15,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.80 vs. limit=15.0 2023-10-12 06:42:24,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=972094.6666666666, ans=10.0 2023-10-12 06:42:28,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=972141.3333333334, ans=0.125 2023-10-12 06:42:32,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.787e+02 1.917e+02 2.193e+02 3.191e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 06:42:38,479 INFO [train.py:1031] (3/4) Epoch 16, batch 3500, loss[loss=0.2016, simple_loss=0.2898, pruned_loss=0.05669, over 16404.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2853, pruned_loss=0.05259, over 27166656.32 frames. ], batch size: 50, lr: 2.20e-03, grad_scale: 32.0 2023-10-12 06:42:51,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=972234.6666666666, ans=0.125 2023-10-12 06:42:56,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=972234.6666666666, ans=0.0 2023-10-12 06:42:56,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972234.6666666666, ans=0.1 2023-10-12 06:43:03,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=972281.3333333334, ans=0.0 2023-10-12 06:43:11,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=972281.3333333334, ans=0.035 2023-10-12 06:43:13,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=972328.0, ans=0.0 2023-10-12 06:43:17,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=972328.0, ans=0.0 2023-10-12 06:43:33,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-10-12 06:44:13,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=972514.6666666666, ans=0.125 2023-10-12 06:44:18,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=972561.3333333334, ans=0.125 2023-10-12 06:44:27,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=972561.3333333334, ans=0.125 2023-10-12 06:44:33,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.768e+02 1.961e+02 2.249e+02 2.812e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-12 06:44:42,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972654.6666666666, ans=0.1 2023-10-12 06:45:01,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=972701.3333333334, ans=0.1 2023-10-12 06:45:10,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=972748.0, ans=15.0 2023-10-12 06:45:46,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=972888.0, ans=0.125 2023-10-12 06:45:46,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=972888.0, ans=0.2 2023-10-12 06:46:01,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-10-12 06:46:19,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=973028.0, ans=0.125 2023-10-12 06:46:32,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.657e+02 1.870e+02 2.081e+02 2.778e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 06:46:38,194 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:46:49,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-10-12 06:47:02,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=973214.6666666666, ans=0.125 2023-10-12 06:47:08,218 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-12 06:47:14,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.43 vs. limit=22.5 2023-10-12 06:48:03,189 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-12 06:48:17,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=973494.6666666666, ans=0.2 2023-10-12 06:48:20,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-10-12 06:48:25,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=15.0 2023-10-12 06:48:32,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.704e+02 1.878e+02 2.031e+02 2.594e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-12 06:48:33,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=973541.3333333334, ans=0.125 2023-10-12 06:48:43,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=973588.0, ans=0.09899494936611666 2023-10-12 06:48:49,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=973634.6666666666, ans=0.2 2023-10-12 06:48:49,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=973634.6666666666, ans=0.125 2023-10-12 06:48:51,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-10-12 06:48:57,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=973634.6666666666, ans=0.025 2023-10-12 06:49:13,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=973728.0, ans=0.2 2023-10-12 06:49:20,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.41 vs. limit=10.0 2023-10-12 06:49:26,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=973774.6666666666, ans=0.1 2023-10-12 06:49:33,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=973821.3333333334, ans=0.0 2023-10-12 06:49:47,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-12 06:49:51,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=973868.0, ans=0.125 2023-10-12 06:50:02,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=973914.6666666666, ans=0.2 2023-10-12 06:50:02,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=973914.6666666666, ans=0.125 2023-10-12 06:50:09,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=973961.3333333334, ans=0.125 2023-10-12 06:50:22,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.700e+02 1.872e+02 2.052e+02 2.548e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-12 06:50:25,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=974008.0, ans=0.125 2023-10-12 06:50:37,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=22.5 2023-10-12 06:51:04,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-10-12 06:51:18,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974241.3333333334, ans=0.1 2023-10-12 06:51:23,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=974288.0, ans=0.0 2023-10-12 06:51:24,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=974288.0, ans=0.125 2023-10-12 06:51:50,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=974381.3333333334, ans=0.035 2023-10-12 06:51:52,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=974381.3333333334, ans=0.07 2023-10-12 06:52:06,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974428.0, ans=0.1 2023-10-12 06:52:07,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=974474.6666666666, ans=0.0 2023-10-12 06:52:07,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=974474.6666666666, ans=0.0 2023-10-12 06:52:15,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.715e+02 1.916e+02 2.249e+02 3.114e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-12 06:52:16,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=974474.6666666666, ans=0.125 2023-10-12 06:52:20,140 INFO [train.py:1031] (3/4) Epoch 16, batch 4000, loss[loss=0.1911, simple_loss=0.2773, pruned_loss=0.0525, over 16852.00 frames. ], tot_loss[loss=0.195, simple_loss=0.285, pruned_loss=0.05256, over 28414296.08 frames. ], batch size: 67, lr: 2.20e-03, grad_scale: 16.0 2023-10-12 06:52:43,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=974568.0, ans=0.0 2023-10-12 06:52:56,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=974661.3333333334, ans=0.125 2023-10-12 06:52:59,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=974661.3333333334, ans=0.125 2023-10-12 06:53:02,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-10-12 06:53:15,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=974708.0, ans=0.2 2023-10-12 06:53:15,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-10-12 06:53:28,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=974801.3333333334, ans=0.1 2023-10-12 06:53:38,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=974801.3333333334, ans=0.1 2023-10-12 06:53:44,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=974848.0, ans=0.0 2023-10-12 06:53:47,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=974848.0, ans=0.035 2023-10-12 06:54:02,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=974941.3333333334, ans=0.125 2023-10-12 06:54:09,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.716e+02 1.855e+02 2.048e+02 2.597e+02, threshold=3.709e+02, percent-clipped=0.0 2023-10-12 06:54:10,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=974941.3333333334, ans=0.125 2023-10-12 06:54:17,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=15.0 2023-10-12 06:54:24,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=975034.6666666666, ans=0.125 2023-10-12 06:54:29,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.24 vs. limit=5.0 2023-10-12 06:54:34,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=975034.6666666666, ans=0.2 2023-10-12 06:54:40,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-10-12 06:54:55,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=975128.0, ans=0.2 2023-10-12 06:55:05,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=975174.6666666666, ans=0.125 2023-10-12 06:55:11,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=975221.3333333334, ans=0.05 2023-10-12 06:55:11,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=22.5 2023-10-12 06:55:13,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.44 vs. limit=15.0 2023-10-12 06:55:16,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=975221.3333333334, ans=0.1 2023-10-12 06:55:21,729 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2023-10-12 06:55:27,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=975268.0, ans=0.125 2023-10-12 06:56:11,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=975408.0, ans=0.125 2023-10-12 06:56:17,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.717e+02 1.916e+02 2.111e+02 3.131e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-12 06:56:20,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=975454.6666666666, ans=0.2 2023-10-12 06:56:25,628 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.71 vs. limit=22.5 2023-10-12 06:56:31,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2023-10-12 06:56:39,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.42 vs. limit=15.0 2023-10-12 06:56:44,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=975501.3333333334, ans=0.125 2023-10-12 06:56:58,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=975594.6666666666, ans=0.125 2023-10-12 06:57:02,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=975594.6666666666, ans=0.2 2023-10-12 06:57:03,164 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:57:35,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=975734.6666666666, ans=0.125 2023-10-12 06:57:37,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-10-12 06:57:47,022 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:57:47,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=975781.3333333334, ans=0.0 2023-10-12 06:57:48,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=975781.3333333334, ans=0.0 2023-10-12 06:57:57,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=975828.0, ans=0.0 2023-10-12 06:57:58,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=975828.0, ans=0.125 2023-10-12 06:58:13,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.677e+02 1.919e+02 2.254e+02 3.431e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-12 06:58:22,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=975921.3333333334, ans=0.125 2023-10-12 06:58:34,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=975968.0, ans=15.0 2023-10-12 06:58:49,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2023-10-12 06:58:57,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=976061.3333333334, ans=0.0 2023-10-12 06:59:01,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=976108.0, ans=0.0 2023-10-12 06:59:11,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=976154.6666666666, ans=0.025 2023-10-12 06:59:13,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=976154.6666666666, ans=0.125 2023-10-12 06:59:13,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=22.5 2023-10-12 06:59:16,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=976154.6666666666, ans=0.1 2023-10-12 06:59:39,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.87 vs. limit=6.0 2023-10-12 07:00:06,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.788e+02 1.970e+02 2.182e+02 3.243e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-12 07:00:06,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=976341.3333333334, ans=0.125 2023-10-12 07:00:17,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=976388.0, ans=0.125 2023-10-12 07:00:23,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-12 07:00:45,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=976528.0, ans=0.5 2023-10-12 07:00:50,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976528.0, ans=0.125 2023-10-12 07:00:51,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.67 vs. limit=6.0 2023-10-12 07:01:11,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.35 vs. limit=10.0 2023-10-12 07:01:14,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=22.5 2023-10-12 07:01:30,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=976668.0, ans=0.0 2023-10-12 07:01:58,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=976761.3333333334, ans=0.2 2023-10-12 07:02:11,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.766e+02 1.909e+02 2.117e+02 3.307e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-12 07:02:13,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.16 vs. limit=15.0 2023-10-12 07:02:15,885 INFO [train.py:1031] (3/4) Epoch 16, batch 4500, loss[loss=0.1896, simple_loss=0.2834, pruned_loss=0.04788, over 16824.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2852, pruned_loss=0.05236, over 29388691.46 frames. ], batch size: 87, lr: 2.20e-03, grad_scale: 32.0 2023-10-12 07:02:16,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=976854.6666666666, ans=0.1 2023-10-12 07:02:17,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=976854.6666666666, ans=0.2 2023-10-12 07:02:34,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=976901.3333333334, ans=0.125 2023-10-12 07:02:43,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=976948.0, ans=0.2 2023-10-12 07:02:46,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=976994.6666666666, ans=0.0 2023-10-12 07:03:10,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=977088.0, ans=0.025 2023-10-12 07:03:42,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=977228.0, ans=0.125 2023-10-12 07:03:46,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=977228.0, ans=0.125 2023-10-12 07:03:54,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977274.6666666666, ans=0.125 2023-10-12 07:03:59,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.735e+02 1.870e+02 2.144e+02 2.941e+02, threshold=3.740e+02, percent-clipped=0.0 2023-10-12 07:04:01,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=977274.6666666666, ans=0.0 2023-10-12 07:04:08,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=977321.3333333334, ans=0.125 2023-10-12 07:04:34,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.97 vs. limit=15.0 2023-10-12 07:04:54,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-10-12 07:04:59,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=977554.6666666666, ans=0.1 2023-10-12 07:05:06,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=977554.6666666666, ans=0.125 2023-10-12 07:05:19,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=977648.0, ans=0.125 2023-10-12 07:05:22,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-10-12 07:05:32,724 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.86 vs. limit=15.0 2023-10-12 07:05:49,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.765e+02 1.919e+02 2.141e+02 2.637e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-12 07:05:49,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977741.3333333334, ans=0.125 2023-10-12 07:05:50,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977741.3333333334, ans=0.1 2023-10-12 07:05:54,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=977788.0, ans=0.07 2023-10-12 07:06:09,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=977834.6666666666, ans=0.125 2023-10-12 07:06:19,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977881.3333333334, ans=0.125 2023-10-12 07:06:20,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-10-12 07:06:24,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=977881.3333333334, ans=0.0 2023-10-12 07:06:28,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=977928.0, ans=0.125 2023-10-12 07:06:34,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-12 07:06:43,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=977974.6666666666, ans=0.2 2023-10-12 07:06:49,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=978021.3333333334, ans=0.0 2023-10-12 07:06:59,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=978068.0, ans=0.125 2023-10-12 07:07:02,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=978068.0, ans=0.125 2023-10-12 07:07:07,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=978114.6666666666, ans=0.2 2023-10-12 07:07:09,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-10-12 07:07:10,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.00 vs. limit=15.0 2023-10-12 07:07:11,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=978114.6666666666, ans=0.1 2023-10-12 07:07:15,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.08 vs. limit=10.0 2023-10-12 07:07:32,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=978208.0, ans=0.04949747468305833 2023-10-12 07:07:36,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.648e+02 1.804e+02 1.961e+02 2.801e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-12 07:07:52,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=978301.3333333334, ans=0.125 2023-10-12 07:08:09,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.29 vs. limit=10.0 2023-10-12 07:08:16,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.85 vs. limit=15.0 2023-10-12 07:08:41,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=978441.3333333334, ans=0.125 2023-10-12 07:08:59,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-10-12 07:09:19,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=978628.0, ans=0.5 2023-10-12 07:09:26,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=978674.6666666666, ans=0.125 2023-10-12 07:09:33,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.738e+02 1.935e+02 2.175e+02 2.793e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 07:09:55,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=978768.0, ans=0.5 2023-10-12 07:09:58,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=978768.0, ans=0.125 2023-10-12 07:09:59,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=978768.0, ans=0.0 2023-10-12 07:10:04,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2023-10-12 07:10:26,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=978908.0, ans=0.125 2023-10-12 07:10:30,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=978908.0, ans=0.125 2023-10-12 07:10:41,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=978954.6666666666, ans=0.125 2023-10-12 07:10:54,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=979001.3333333334, ans=0.1 2023-10-12 07:10:59,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=979048.0, ans=0.1 2023-10-12 07:11:08,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979048.0, ans=0.1 2023-10-12 07:11:09,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=979048.0, ans=0.1 2023-10-12 07:11:14,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.88 vs. limit=15.0 2023-10-12 07:11:14,700 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:11:24,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=979141.3333333334, ans=0.125 2023-10-12 07:11:30,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.758e+02 1.958e+02 2.264e+02 3.264e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-12 07:11:33,514 INFO [train.py:1031] (3/4) Epoch 16, batch 5000, loss[loss=0.2004, simple_loss=0.2921, pruned_loss=0.0544, over 16830.00 frames. ], tot_loss[loss=0.195, simple_loss=0.285, pruned_loss=0.05253, over 30118963.40 frames. ], batch size: 155, lr: 2.19e-03, grad_scale: 16.0 2023-10-12 07:11:43,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=979234.6666666666, ans=15.0 2023-10-12 07:11:49,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=979234.6666666666, ans=0.125 2023-10-12 07:11:50,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.93 vs. limit=22.5 2023-10-12 07:11:59,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979281.3333333334, ans=0.1 2023-10-12 07:12:07,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=979328.0, ans=0.2 2023-10-12 07:12:11,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=979328.0, ans=0.125 2023-10-12 07:12:31,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=979421.3333333334, ans=0.0 2023-10-12 07:12:32,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2023-10-12 07:12:37,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=979421.3333333334, ans=0.125 2023-10-12 07:12:40,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.23 vs. limit=22.5 2023-10-12 07:12:41,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=979468.0, ans=0.125 2023-10-12 07:12:51,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=979514.6666666666, ans=0.2 2023-10-12 07:12:54,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=979514.6666666666, ans=0.125 2023-10-12 07:12:58,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=979514.6666666666, ans=0.125 2023-10-12 07:13:22,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.729e+02 1.890e+02 2.064e+02 2.683e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-12 07:13:43,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=979701.3333333334, ans=0.0 2023-10-12 07:13:48,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=979701.3333333334, ans=0.0 2023-10-12 07:13:53,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=979748.0, ans=0.125 2023-10-12 07:14:15,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=979841.3333333334, ans=0.1 2023-10-12 07:14:18,511 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-12 07:14:27,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-10-12 07:14:38,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=979934.6666666666, ans=0.0 2023-10-12 07:15:14,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.739e+02 1.966e+02 2.203e+02 3.388e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-12 07:15:31,776 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:15:48,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=980261.3333333334, ans=0.0 2023-10-12 07:16:22,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.10 vs. limit=10.0 2023-10-12 07:16:43,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980448.0, ans=0.1 2023-10-12 07:16:55,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=980494.6666666666, ans=0.0 2023-10-12 07:16:58,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=980494.6666666666, ans=0.125 2023-10-12 07:16:58,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=980494.6666666666, ans=0.125 2023-10-12 07:17:04,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=980541.3333333334, ans=0.125 2023-10-12 07:17:07,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=980541.3333333334, ans=0.125 2023-10-12 07:17:09,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.694e+02 1.865e+02 2.017e+02 2.692e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-12 07:17:15,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=980588.0, ans=0.04949747468305833 2023-10-12 07:17:17,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=980588.0, ans=0.2 2023-10-12 07:17:17,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=980588.0, ans=0.125 2023-10-12 07:17:24,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=980634.6666666666, ans=0.2 2023-10-12 07:17:39,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=980681.3333333334, ans=0.125 2023-10-12 07:17:42,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980681.3333333334, ans=0.1 2023-10-12 07:17:45,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-12 07:17:52,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.28 vs. limit=15.0 2023-10-12 07:18:46,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980914.6666666666, ans=0.1 2023-10-12 07:18:53,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=980961.3333333334, ans=0.0 2023-10-12 07:18:56,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980961.3333333334, ans=0.1 2023-10-12 07:19:09,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.630e+02 1.826e+02 2.011e+02 3.198e+02, threshold=3.651e+02, percent-clipped=0.0 2023-10-12 07:19:14,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=981054.6666666666, ans=0.015 2023-10-12 07:19:47,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=981194.6666666666, ans=0.1 2023-10-12 07:20:06,071 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-12 07:20:29,444 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:20:30,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=981381.3333333334, ans=0.125 2023-10-12 07:20:44,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=981428.0, ans=0.0 2023-10-12 07:20:48,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=981428.0, ans=0.125 2023-10-12 07:20:54,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=981474.6666666666, ans=0.125 2023-10-12 07:20:55,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=981474.6666666666, ans=0.125 2023-10-12 07:20:58,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.703e+02 1.975e+02 2.252e+02 2.999e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-12 07:20:59,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.56 vs. limit=10.0 2023-10-12 07:21:01,171 INFO [train.py:1031] (3/4) Epoch 16, batch 5500, loss[loss=0.1849, simple_loss=0.2819, pruned_loss=0.04401, over 16946.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2849, pruned_loss=0.05242, over 30720100.49 frames. ], batch size: 138, lr: 2.19e-03, grad_scale: 32.0 2023-10-12 07:21:06,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=981521.3333333334, ans=0.2 2023-10-12 07:21:14,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=981568.0, ans=0.125 2023-10-12 07:21:39,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=981661.3333333334, ans=0.2 2023-10-12 07:21:41,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=981661.3333333334, ans=0.125 2023-10-12 07:21:41,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.62 vs. limit=15.0 2023-10-12 07:22:09,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=981801.3333333334, ans=0.125 2023-10-12 07:22:15,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=981848.0, ans=0.0 2023-10-12 07:22:18,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=981848.0, ans=0.2 2023-10-12 07:22:23,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=981848.0, ans=0.125 2023-10-12 07:22:41,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=981941.3333333334, ans=0.125 2023-10-12 07:22:46,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.720e+02 1.853e+02 2.009e+02 2.643e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-12 07:22:51,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=981988.0, ans=0.125 2023-10-12 07:22:54,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=981988.0, ans=0.125 2023-10-12 07:23:02,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=982034.6666666666, ans=0.07 2023-10-12 07:23:03,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=982034.6666666666, ans=0.0 2023-10-12 07:23:04,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=22.5 2023-10-12 07:23:14,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=982081.3333333334, ans=0.1 2023-10-12 07:23:19,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=982081.3333333334, ans=0.125 2023-10-12 07:23:30,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=982128.0, ans=0.2 2023-10-12 07:23:46,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=982221.3333333334, ans=0.125 2023-10-12 07:23:48,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=982221.3333333334, ans=15.0 2023-10-12 07:24:04,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=982268.0, ans=0.1 2023-10-12 07:24:17,869 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:24:27,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=982361.3333333334, ans=0.0 2023-10-12 07:24:29,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=982361.3333333334, ans=0.2 2023-10-12 07:24:40,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=982408.0, ans=22.5 2023-10-12 07:24:40,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.747e+02 1.910e+02 2.094e+02 2.874e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-12 07:24:44,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.69 vs. limit=15.0 2023-10-12 07:24:54,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-10-12 07:24:57,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=982501.3333333334, ans=0.125 2023-10-12 07:25:10,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982548.0, ans=0.1 2023-10-12 07:25:12,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=982548.0, ans=0.0 2023-10-12 07:25:31,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=982641.3333333334, ans=12.0 2023-10-12 07:26:05,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=982781.3333333334, ans=0.125 2023-10-12 07:26:33,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.738e+02 1.888e+02 2.110e+02 2.709e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-12 07:26:40,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=982921.3333333334, ans=0.2 2023-10-12 07:26:45,489 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.05 vs. limit=10.0 2023-10-12 07:26:54,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=982968.0, ans=0.0 2023-10-12 07:26:59,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-12 07:27:19,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=983061.3333333334, ans=0.125 2023-10-12 07:27:24,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.26 vs. limit=15.0 2023-10-12 07:27:39,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=983154.6666666666, ans=0.0 2023-10-12 07:27:43,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=983201.3333333334, ans=0.125 2023-10-12 07:27:54,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983201.3333333334, ans=0.1 2023-10-12 07:28:00,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=983248.0, ans=0.125 2023-10-12 07:28:01,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=983248.0, ans=0.125 2023-10-12 07:28:09,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=983294.6666666666, ans=0.125 2023-10-12 07:28:13,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=983294.6666666666, ans=0.125 2023-10-12 07:28:27,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.645e+02 1.784e+02 2.008e+02 2.983e+02, threshold=3.567e+02, percent-clipped=0.0 2023-10-12 07:28:27,916 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:28:43,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.78 vs. limit=22.5 2023-10-12 07:28:54,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983481.3333333334, ans=0.1 2023-10-12 07:29:01,168 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:29:24,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=983574.6666666666, ans=0.125 2023-10-12 07:29:28,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=983621.3333333334, ans=0.125 2023-10-12 07:29:36,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=983621.3333333334, ans=0.0 2023-10-12 07:29:42,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=983668.0, ans=0.2 2023-10-12 07:29:57,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=983714.6666666666, ans=0.025 2023-10-12 07:30:02,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=983761.3333333334, ans=0.07 2023-10-12 07:30:02,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=983761.3333333334, ans=0.1 2023-10-12 07:30:02,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=983761.3333333334, ans=0.125 2023-10-12 07:30:05,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=983761.3333333334, ans=0.125 2023-10-12 07:30:08,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=983808.0, ans=0.0 2023-10-12 07:30:09,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=983808.0, ans=0.2 2023-10-12 07:30:15,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983808.0, ans=0.1 2023-10-12 07:30:17,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.726e+02 1.903e+02 2.071e+02 2.794e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-12 07:30:20,398 INFO [train.py:1031] (3/4) Epoch 16, batch 6000, loss[loss=0.1958, simple_loss=0.2899, pruned_loss=0.05082, over 16930.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2852, pruned_loss=0.05268, over 31190560.30 frames. ], batch size: 93, lr: 2.19e-03, grad_scale: 32.0 2023-10-12 07:30:22,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-10-12 07:30:26,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=983854.6666666666, ans=0.0 2023-10-12 07:30:41,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=983901.3333333334, ans=0.125 2023-10-12 07:30:45,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=983948.0, ans=0.0 2023-10-12 07:30:52,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=983948.0, ans=0.125 2023-10-12 07:30:58,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-10-12 07:31:13,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=984041.3333333334, ans=0.0 2023-10-12 07:31:23,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2023-10-12 07:31:28,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=984134.6666666666, ans=0.0 2023-10-12 07:31:44,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=984181.3333333334, ans=0.0 2023-10-12 07:32:08,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.708e+02 1.869e+02 2.137e+02 3.448e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-12 07:32:11,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=984321.3333333334, ans=0.125 2023-10-12 07:32:16,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=984321.3333333334, ans=0.125 2023-10-12 07:32:36,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=984414.6666666666, ans=0.125 2023-10-12 07:32:58,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=984508.0, ans=0.09899494936611666 2023-10-12 07:33:00,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=984508.0, ans=0.95 2023-10-12 07:33:13,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=984554.6666666666, ans=0.0 2023-10-12 07:33:14,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-12 07:33:46,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=984694.6666666666, ans=0.0 2023-10-12 07:33:46,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=984694.6666666666, ans=0.125 2023-10-12 07:33:48,171 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-10-12 07:33:51,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=984694.6666666666, ans=0.125 2023-10-12 07:33:52,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=984741.3333333334, ans=0.125 2023-10-12 07:34:00,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.786e+02 1.973e+02 2.241e+02 3.528e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-12 07:34:47,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984974.6666666666, ans=0.1 2023-10-12 07:35:08,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-12 07:35:23,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=985114.6666666666, ans=12.0 2023-10-12 07:35:28,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-10-12 07:35:41,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=985161.3333333334, ans=0.0 2023-10-12 07:35:43,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2023-10-12 07:35:46,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=985208.0, ans=0.0 2023-10-12 07:35:48,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=985208.0, ans=0.0 2023-10-12 07:35:50,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=985208.0, ans=0.125 2023-10-12 07:35:51,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.684e+02 1.837e+02 2.044e+02 2.542e+02, threshold=3.674e+02, percent-clipped=0.0 2023-10-12 07:36:01,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=985254.6666666666, ans=0.1 2023-10-12 07:36:12,586 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-12 07:37:14,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=985534.6666666666, ans=0.1 2023-10-12 07:37:42,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=985628.0, ans=0.0 2023-10-12 07:37:46,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=985674.6666666666, ans=0.5 2023-10-12 07:37:46,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=22.5 2023-10-12 07:37:47,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=985674.6666666666, ans=0.07 2023-10-12 07:37:54,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.723e+02 1.889e+02 2.137e+02 2.895e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 07:37:58,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=985721.3333333334, ans=0.0 2023-10-12 07:38:07,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=985768.0, ans=0.125 2023-10-12 07:38:08,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=985768.0, ans=0.0 2023-10-12 07:38:09,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=985768.0, ans=0.125 2023-10-12 07:38:11,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=985768.0, ans=0.0 2023-10-12 07:38:14,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=985768.0, ans=0.0 2023-10-12 07:38:21,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=985814.6666666666, ans=0.1 2023-10-12 07:38:34,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=985861.3333333334, ans=0.0 2023-10-12 07:38:46,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=985908.0, ans=0.0 2023-10-12 07:38:51,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=985954.6666666666, ans=0.125 2023-10-12 07:39:01,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=985954.6666666666, ans=0.125 2023-10-12 07:39:11,571 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:39:24,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-10-12 07:39:26,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=986094.6666666666, ans=0.0 2023-10-12 07:39:30,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-10-12 07:39:47,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=986141.3333333334, ans=0.125 2023-10-12 07:39:48,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.731e+02 1.884e+02 2.079e+02 3.368e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-12 07:39:50,149 INFO [train.py:1031] (3/4) Epoch 16, batch 6500, loss[loss=0.1619, simple_loss=0.2577, pruned_loss=0.03307, over 16363.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2857, pruned_loss=0.05284, over 31523864.01 frames. ], batch size: 50, lr: 2.19e-03, grad_scale: 32.0 2023-10-12 07:39:51,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=12.0 2023-10-12 07:40:10,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.32 vs. limit=15.0 2023-10-12 07:40:19,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=986281.3333333334, ans=0.125 2023-10-12 07:40:32,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=986328.0, ans=0.2 2023-10-12 07:40:46,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=986374.6666666666, ans=0.125 2023-10-12 07:40:47,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=986374.6666666666, ans=0.1 2023-10-12 07:40:53,881 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:40:54,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=986374.6666666666, ans=0.0 2023-10-12 07:41:12,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=986468.0, ans=0.07 2023-10-12 07:41:17,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=986468.0, ans=0.0 2023-10-12 07:41:39,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=986561.3333333334, ans=0.125 2023-10-12 07:41:41,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=986608.0, ans=0.2 2023-10-12 07:41:52,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.706e+02 1.913e+02 2.116e+02 2.987e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-12 07:41:59,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-12 07:42:22,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=986748.0, ans=0.5 2023-10-12 07:42:33,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=986794.6666666666, ans=0.125 2023-10-12 07:42:43,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-10-12 07:43:00,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=986934.6666666666, ans=0.1 2023-10-12 07:43:05,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986934.6666666666, ans=0.1 2023-10-12 07:43:32,432 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:43:40,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.729e+02 1.905e+02 2.192e+02 3.140e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 07:43:41,272 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:43:51,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.00 vs. limit=10.0 2023-10-12 07:44:01,545 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2023-10-12 07:44:12,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=987214.6666666666, ans=0.125 2023-10-12 07:44:19,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=987261.3333333334, ans=0.2 2023-10-12 07:44:45,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=12.0 2023-10-12 07:45:20,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=987494.6666666666, ans=0.07 2023-10-12 07:45:28,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=987541.3333333334, ans=0.07 2023-10-12 07:45:36,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.659e+02 1.803e+02 2.101e+02 3.161e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-12 07:45:54,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-10-12 07:45:59,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=987634.6666666666, ans=0.1 2023-10-12 07:46:01,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-10-12 07:46:35,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=987728.0, ans=0.025 2023-10-12 07:46:39,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=987774.6666666666, ans=0.2 2023-10-12 07:46:39,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=987774.6666666666, ans=0.0 2023-10-12 07:46:40,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=987774.6666666666, ans=0.0 2023-10-12 07:46:42,750 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-10-12 07:47:04,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=987868.0, ans=0.125 2023-10-12 07:47:04,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=987868.0, ans=0.1 2023-10-12 07:47:05,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=987868.0, ans=0.0 2023-10-12 07:47:13,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=987914.6666666666, ans=0.5 2023-10-12 07:47:16,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=987914.6666666666, ans=0.125 2023-10-12 07:47:23,328 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:47:28,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=987961.3333333334, ans=0.125 2023-10-12 07:47:44,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.634e+02 1.796e+02 2.026e+02 3.215e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-12 07:47:52,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=988054.6666666666, ans=0.0 2023-10-12 07:48:03,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-12 07:48:11,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2023-10-12 07:48:23,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988194.6666666666, ans=0.1 2023-10-12 07:49:06,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=988381.3333333334, ans=0.125 2023-10-12 07:49:19,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2023-10-12 07:49:22,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=988474.6666666666, ans=0.125 2023-10-12 07:49:31,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.788e+02 2.022e+02 2.239e+02 2.905e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-12 07:49:32,725 INFO [train.py:1031] (3/4) Epoch 16, batch 7000, loss[loss=0.2096, simple_loss=0.2972, pruned_loss=0.061, over 16860.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.286, pruned_loss=0.05273, over 31800104.27 frames. ], batch size: 110, lr: 2.18e-03, grad_scale: 16.0 2023-10-12 07:49:45,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=988568.0, ans=0.1 2023-10-12 07:49:57,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-10-12 07:50:08,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=988614.6666666666, ans=0.0 2023-10-12 07:50:12,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=12.0 2023-10-12 07:50:27,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=988708.0, ans=0.125 2023-10-12 07:50:28,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=988708.0, ans=0.125 2023-10-12 07:51:16,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=988941.3333333334, ans=0.125 2023-10-12 07:51:25,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=15.0 2023-10-12 07:51:27,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.748e+02 1.924e+02 2.105e+02 2.634e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-12 07:51:54,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=989081.3333333334, ans=0.1 2023-10-12 07:52:00,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=989081.3333333334, ans=0.125 2023-10-12 07:52:06,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=989128.0, ans=0.125 2023-10-12 07:52:07,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=989128.0, ans=0.125 2023-10-12 07:52:10,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=989128.0, ans=0.07 2023-10-12 07:52:14,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=989174.6666666666, ans=0.2 2023-10-12 07:52:19,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.83 vs. limit=22.5 2023-10-12 07:52:29,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=989221.3333333334, ans=0.125 2023-10-12 07:52:53,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=989314.6666666666, ans=0.0 2023-10-12 07:52:55,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.30 vs. limit=12.0 2023-10-12 07:53:04,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=15.0 2023-10-12 07:53:19,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.758e+02 1.988e+02 2.188e+02 3.079e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-12 07:53:33,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=989454.6666666666, ans=0.125 2023-10-12 07:53:53,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-10-12 07:53:57,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=989548.0, ans=0.0 2023-10-12 07:54:01,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=989548.0, ans=0.125 2023-10-12 07:54:16,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=989594.6666666666, ans=0.0 2023-10-12 07:54:20,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-10-12 07:54:36,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.79 vs. limit=15.0 2023-10-12 07:54:49,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2023-10-12 07:55:00,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=989781.3333333334, ans=0.0 2023-10-12 07:55:06,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=989828.0, ans=0.125 2023-10-12 07:55:18,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=989874.6666666666, ans=0.125 2023-10-12 07:55:29,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.710e+02 1.886e+02 2.046e+02 3.321e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-12 07:55:53,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=989968.0, ans=0.2 2023-10-12 07:55:53,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.60 vs. limit=10.0 2023-10-12 07:56:17,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=990061.3333333334, ans=0.2 2023-10-12 07:56:22,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=990108.0, ans=0.125 2023-10-12 07:56:50,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=990201.3333333334, ans=0.125 2023-10-12 07:57:09,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=990294.6666666666, ans=0.0 2023-10-12 07:57:09,957 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.06 vs. limit=12.0 2023-10-12 07:57:15,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-10-12 07:57:26,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=990341.3333333334, ans=0.0 2023-10-12 07:57:28,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.684e+02 1.847e+02 2.047e+02 3.622e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-12 07:57:47,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=990434.6666666666, ans=0.0 2023-10-12 07:58:06,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-12 07:58:16,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.50 vs. limit=15.0 2023-10-12 07:58:25,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=990621.3333333334, ans=0.125 2023-10-12 07:58:26,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=990621.3333333334, ans=0.125 2023-10-12 07:58:33,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=990621.3333333334, ans=0.0 2023-10-12 07:58:37,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-12 07:59:20,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.752e+02 1.965e+02 2.156e+02 2.768e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-12 07:59:20,193 INFO [train.py:1031] (3/4) Epoch 16, batch 7500, loss[loss=0.2009, simple_loss=0.2941, pruned_loss=0.0539, over 16845.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2859, pruned_loss=0.05286, over 32002920.03 frames. ], batch size: 67, lr: 2.18e-03, grad_scale: 16.0 2023-10-12 07:59:21,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=990854.6666666666, ans=0.0 2023-10-12 07:59:41,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=990948.0, ans=0.125 2023-10-12 08:00:06,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=15.0 2023-10-12 08:00:08,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=991041.3333333334, ans=0.1 2023-10-12 08:00:17,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=991088.0, ans=0.015 2023-10-12 08:00:23,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=991088.0, ans=0.125 2023-10-12 08:00:27,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=991088.0, ans=0.125 2023-10-12 08:00:32,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=991134.6666666666, ans=0.5 2023-10-12 08:00:41,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=991181.3333333334, ans=0.125 2023-10-12 08:00:52,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=991228.0, ans=0.125 2023-10-12 08:00:55,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=991228.0, ans=0.125 2023-10-12 08:01:12,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.761e+02 1.959e+02 2.305e+02 3.215e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 08:01:15,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=991321.3333333334, ans=0.125 2023-10-12 08:01:22,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=991368.0, ans=0.025 2023-10-12 08:01:25,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=991368.0, ans=0.0 2023-10-12 08:01:48,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=991414.6666666666, ans=0.125 2023-10-12 08:01:48,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=991414.6666666666, ans=0.1 2023-10-12 08:01:50,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-10-12 08:01:59,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=991461.3333333334, ans=0.125 2023-10-12 08:02:01,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=991461.3333333334, ans=0.125 2023-10-12 08:02:07,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=991508.0, ans=0.2 2023-10-12 08:02:17,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=991508.0, ans=0.0 2023-10-12 08:02:23,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=991554.6666666666, ans=0.125 2023-10-12 08:02:30,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-10-12 08:02:34,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=991601.3333333334, ans=0.07 2023-10-12 08:03:16,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.726e+02 1.968e+02 2.205e+02 3.050e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-12 08:04:28,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.52 vs. limit=10.0 2023-10-12 08:04:39,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=992114.6666666666, ans=0.125 2023-10-12 08:04:39,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=992114.6666666666, ans=0.125 2023-10-12 08:05:02,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=992208.0, ans=0.0 2023-10-12 08:05:05,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=992254.6666666666, ans=0.125 2023-10-12 08:05:05,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=992254.6666666666, ans=0.125 2023-10-12 08:05:07,022 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.763e+02 1.973e+02 2.119e+02 2.807e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-12 08:05:11,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=992254.6666666666, ans=0.125 2023-10-12 08:05:19,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=992301.3333333334, ans=0.2 2023-10-12 08:05:30,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=992301.3333333334, ans=0.125 2023-10-12 08:05:46,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992394.6666666666, ans=0.1 2023-10-12 08:06:25,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=992534.6666666666, ans=0.125 2023-10-12 08:06:49,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.60 vs. limit=15.0 2023-10-12 08:06:53,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=992628.0, ans=0.2 2023-10-12 08:07:02,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=992674.6666666666, ans=0.1 2023-10-12 08:07:07,570 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.754e+02 1.947e+02 2.096e+02 3.192e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-12 08:07:24,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=992768.0, ans=0.125 2023-10-12 08:07:46,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.37 vs. limit=15.0 2023-10-12 08:07:57,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992908.0, ans=0.1 2023-10-12 08:08:01,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=992908.0, ans=0.07 2023-10-12 08:08:02,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=992908.0, ans=0.1 2023-10-12 08:08:14,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=992954.6666666666, ans=0.0 2023-10-12 08:08:23,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993001.3333333334, ans=0.1 2023-10-12 08:08:33,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=993048.0, ans=0.1 2023-10-12 08:08:37,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=993048.0, ans=0.0 2023-10-12 08:08:38,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-10-12 08:08:44,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=993094.6666666666, ans=0.125 2023-10-12 08:08:44,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=993094.6666666666, ans=0.125 2023-10-12 08:08:46,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.83 vs. limit=22.5 2023-10-12 08:09:04,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-10-12 08:09:06,025 INFO [train.py:1031] (3/4) Epoch 16, batch 8000, loss[loss=0.1814, simple_loss=0.2746, pruned_loss=0.04417, over 16567.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2853, pruned_loss=0.05232, over 32173786.06 frames. ], batch size: 61, lr: 2.18e-03, grad_scale: 32.0 2023-10-12 08:09:07,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.635e+02 1.813e+02 1.989e+02 2.922e+02, threshold=3.626e+02, percent-clipped=0.0 2023-10-12 08:09:07,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=993188.0, ans=0.2 2023-10-12 08:09:09,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=993188.0, ans=0.0 2023-10-12 08:09:13,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=993188.0, ans=0.125 2023-10-12 08:09:24,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=993234.6666666666, ans=10.0 2023-10-12 08:09:40,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=993328.0, ans=0.125 2023-10-12 08:10:31,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=993514.6666666666, ans=0.125 2023-10-12 08:10:49,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-10-12 08:10:57,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.672e+02 1.839e+02 2.250e+02 3.548e+02, threshold=3.679e+02, percent-clipped=0.0 2023-10-12 08:11:00,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=993654.6666666666, ans=0.0 2023-10-12 08:11:20,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.18 vs. limit=15.0 2023-10-12 08:11:22,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=993748.0, ans=0.0 2023-10-12 08:11:32,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=993794.6666666666, ans=0.125 2023-10-12 08:12:07,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=993888.0, ans=0.0 2023-10-12 08:12:51,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=994028.0, ans=0.0 2023-10-12 08:12:56,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=994074.6666666666, ans=0.0 2023-10-12 08:13:10,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.725e+02 1.880e+02 2.008e+02 2.715e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-12 08:13:19,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=994168.0, ans=0.0 2023-10-12 08:13:28,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=994214.6666666666, ans=0.0 2023-10-12 08:13:28,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994214.6666666666, ans=0.1 2023-10-12 08:13:39,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=994214.6666666666, ans=0.0 2023-10-12 08:13:41,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=994261.3333333334, ans=0.125 2023-10-12 08:13:43,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=994261.3333333334, ans=0.1 2023-10-12 08:14:10,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=994354.6666666666, ans=0.125 2023-10-12 08:14:26,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=994448.0, ans=0.0 2023-10-12 08:14:48,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=994541.3333333334, ans=0.125 2023-10-12 08:15:00,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.759e+02 1.982e+02 2.179e+02 2.666e+02, threshold=3.965e+02, percent-clipped=0.0 2023-10-12 08:15:13,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=994634.6666666666, ans=0.0 2023-10-12 08:15:24,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-10-12 08:15:28,316 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:15:36,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=994728.0, ans=0.0 2023-10-12 08:15:38,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=994728.0, ans=0.2 2023-10-12 08:15:39,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-10-12 08:16:06,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=994868.0, ans=0.1 2023-10-12 08:16:08,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=15.0 2023-10-12 08:16:22,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=994914.6666666666, ans=0.0 2023-10-12 08:16:30,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=994961.3333333334, ans=0.0 2023-10-12 08:16:36,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=995008.0, ans=0.125 2023-10-12 08:16:48,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.720e+02 1.857e+02 2.097e+02 3.011e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 08:17:33,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2023-10-12 08:17:38,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=995241.3333333334, ans=0.125 2023-10-12 08:17:38,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=995241.3333333334, ans=0.0 2023-10-12 08:17:55,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=995288.0, ans=0.125 2023-10-12 08:18:13,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=995381.3333333334, ans=0.0 2023-10-12 08:18:23,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=995428.0, ans=0.2 2023-10-12 08:18:49,040 INFO [train.py:1031] (3/4) Epoch 16, batch 8500, loss[loss=0.183, simple_loss=0.2766, pruned_loss=0.04472, over 16809.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2857, pruned_loss=0.05224, over 32314423.44 frames. ], batch size: 175, lr: 2.18e-03, grad_scale: 32.0 2023-10-12 08:18:50,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.769e+02 1.910e+02 2.169e+02 3.720e+02, threshold=3.821e+02, percent-clipped=1.0 2023-10-12 08:18:57,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=995521.3333333334, ans=0.0 2023-10-12 08:19:03,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=995568.0, ans=0.2 2023-10-12 08:19:19,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=995614.6666666666, ans=0.125 2023-10-12 08:19:43,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995708.0, ans=0.1 2023-10-12 08:19:54,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=995754.6666666666, ans=0.95 2023-10-12 08:20:09,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=995848.0, ans=0.0 2023-10-12 08:20:19,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995848.0, ans=0.1 2023-10-12 08:20:33,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=995941.3333333334, ans=0.125 2023-10-12 08:20:53,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.872e+02 2.048e+02 2.400e+02 3.218e+02, threshold=4.096e+02, percent-clipped=0.0 2023-10-12 08:21:04,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=996034.6666666666, ans=0.125 2023-10-12 08:21:11,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-12 08:21:19,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=996081.3333333334, ans=0.0 2023-10-12 08:21:25,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=996128.0, ans=0.0 2023-10-12 08:22:04,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=996268.0, ans=0.0 2023-10-12 08:22:14,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=996314.6666666666, ans=0.0 2023-10-12 08:22:14,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=996314.6666666666, ans=0.0 2023-10-12 08:22:19,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=996314.6666666666, ans=0.125 2023-10-12 08:22:20,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=996314.6666666666, ans=0.0 2023-10-12 08:22:55,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.281e+02 1.626e+02 1.785e+02 1.960e+02 2.853e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-12 08:23:00,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=996454.6666666666, ans=0.07 2023-10-12 08:23:11,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=996501.3333333334, ans=0.0 2023-10-12 08:23:12,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=996548.0, ans=0.0 2023-10-12 08:23:38,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=996641.3333333334, ans=0.125 2023-10-12 08:23:48,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=996688.0, ans=0.0 2023-10-12 08:23:51,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-12 08:24:16,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.04 vs. limit=15.0 2023-10-12 08:24:19,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=996781.3333333334, ans=0.125 2023-10-12 08:24:33,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=996828.0, ans=0.1 2023-10-12 08:24:36,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=996828.0, ans=0.1 2023-10-12 08:24:39,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.28 vs. limit=10.0 2023-10-12 08:24:40,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.88 vs. limit=15.0 2023-10-12 08:24:42,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=996874.6666666666, ans=0.125 2023-10-12 08:24:51,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-10-12 08:24:52,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.646e+02 1.818e+02 2.058e+02 3.166e+02, threshold=3.635e+02, percent-clipped=0.0 2023-10-12 08:25:09,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=996968.0, ans=0.125 2023-10-12 08:25:13,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=997014.6666666666, ans=0.5 2023-10-12 08:25:13,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=997014.6666666666, ans=0.0 2023-10-12 08:25:34,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=997108.0, ans=0.2 2023-10-12 08:25:43,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=997154.6666666666, ans=0.0 2023-10-12 08:25:53,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=997201.3333333334, ans=0.0 2023-10-12 08:25:58,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=997201.3333333334, ans=0.125 2023-10-12 08:26:06,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-12 08:26:07,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-12 08:26:15,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-10-12 08:26:32,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=997341.3333333334, ans=0.0 2023-10-12 08:26:37,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=997341.3333333334, ans=0.125 2023-10-12 08:26:37,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=997388.0, ans=0.125 2023-10-12 08:26:42,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.732e+02 1.904e+02 2.087e+02 2.879e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-12 08:26:43,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=997388.0, ans=0.125 2023-10-12 08:26:56,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=997434.6666666666, ans=0.125 2023-10-12 08:27:00,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.75 vs. limit=22.5 2023-10-12 08:27:00,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-10-12 08:27:01,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=997481.3333333334, ans=0.125 2023-10-12 08:27:17,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=997528.0, ans=0.0 2023-10-12 08:27:18,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=997528.0, ans=0.1 2023-10-12 08:27:25,171 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:27:32,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=997621.3333333334, ans=10.0 2023-10-12 08:27:57,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=997714.6666666666, ans=0.125 2023-10-12 08:28:11,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=997761.3333333334, ans=0.2 2023-10-12 08:28:13,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=997761.3333333334, ans=0.125 2023-10-12 08:28:18,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=997808.0, ans=0.2 2023-10-12 08:28:25,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=997808.0, ans=0.125 2023-10-12 08:28:29,439 INFO [train.py:1031] (3/4) Epoch 16, batch 9000, loss[loss=0.177, simple_loss=0.282, pruned_loss=0.03606, over 16817.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.285, pruned_loss=0.05191, over 32445885.67 frames. ], batch size: 98, lr: 2.17e-03, grad_scale: 16.0 2023-10-12 08:28:34,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.673e+02 1.854e+02 1.998e+02 3.055e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-12 08:28:35,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=997854.6666666666, ans=0.125 2023-10-12 08:28:57,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=997948.0, ans=0.0 2023-10-12 08:29:02,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=997994.6666666666, ans=0.2 2023-10-12 08:29:09,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=997994.6666666666, ans=0.0 2023-10-12 08:29:11,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=997994.6666666666, ans=0.0 2023-10-12 08:29:13,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998041.3333333334, ans=0.1 2023-10-12 08:29:37,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=998134.6666666666, ans=0.125 2023-10-12 08:29:40,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=998134.6666666666, ans=0.125 2023-10-12 08:29:40,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-10-12 08:29:52,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998181.3333333334, ans=0.1 2023-10-12 08:30:19,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.747e+02 1.882e+02 2.041e+02 2.714e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-12 08:30:19,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=998321.3333333334, ans=0.0 2023-10-12 08:30:20,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=998321.3333333334, ans=0.125 2023-10-12 08:30:30,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=998368.0, ans=0.125 2023-10-12 08:30:34,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=998368.0, ans=0.1 2023-10-12 08:30:41,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=998414.6666666666, ans=0.2 2023-10-12 08:30:46,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=998461.3333333334, ans=0.125 2023-10-12 08:30:50,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=998461.3333333334, ans=0.0 2023-10-12 08:31:00,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=998508.0, ans=0.2 2023-10-12 08:31:02,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=998508.0, ans=0.125 2023-10-12 08:31:07,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=998554.6666666666, ans=0.125 2023-10-12 08:31:09,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=998554.6666666666, ans=0.125 2023-10-12 08:31:11,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=998554.6666666666, ans=0.1 2023-10-12 08:31:26,226 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:31:26,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=998601.3333333334, ans=0.125 2023-10-12 08:31:31,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=998648.0, ans=0.125 2023-10-12 08:31:36,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=998648.0, ans=0.0 2023-10-12 08:31:36,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-10-12 08:31:42,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=998694.6666666666, ans=0.015 2023-10-12 08:32:03,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.805e+02 1.932e+02 2.120e+02 3.278e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-12 08:32:04,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=998788.0, ans=0.0 2023-10-12 08:32:13,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998834.6666666666, ans=0.1 2023-10-12 08:32:28,398 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-10-12 08:32:54,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=999021.3333333334, ans=0.125 2023-10-12 08:32:54,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=999021.3333333334, ans=0.125 2023-10-12 08:32:55,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=999021.3333333334, ans=0.125 2023-10-12 08:33:02,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=999021.3333333334, ans=0.0 2023-10-12 08:33:30,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=999161.3333333334, ans=0.2 2023-10-12 08:33:49,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=15.0 2023-10-12 08:33:52,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.758e+02 1.917e+02 2.137e+02 3.201e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-12 08:34:09,410 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:35:05,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=999534.6666666666, ans=0.0 2023-10-12 08:35:18,240 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=7.193e-02 2023-10-12 08:35:26,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=999581.3333333334, ans=0.125 2023-10-12 08:35:55,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=999721.3333333334, ans=0.0 2023-10-12 08:35:55,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.760e+02 1.930e+02 2.138e+02 2.816e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 08:36:08,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=999768.0, ans=0.125 2023-10-12 08:36:09,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=999768.0, ans=0.5 2023-10-12 08:36:23,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-10-12 08:36:32,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.24 vs. limit=15.0 2023-10-12 08:36:53,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.38 vs. limit=22.5 2023-10-12 08:36:58,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=999954.6666666666, ans=0.125 2023-10-12 08:37:00,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1000001.3333333334, ans=0.125 2023-10-12 08:37:05,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1000001.3333333334, ans=0.0 2023-10-12 08:37:22,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000048.0, ans=0.1 2023-10-12 08:37:29,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1000094.6666666666, ans=0.04949747468305833 2023-10-12 08:37:49,590 INFO [train.py:1031] (3/4) Epoch 16, batch 9500, loss[loss=0.1905, simple_loss=0.2896, pruned_loss=0.04567, over 16852.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2856, pruned_loss=0.05221, over 32515895.53 frames. ], batch size: 72, lr: 2.17e-03, grad_scale: 32.0 2023-10-12 08:37:51,157 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=15.0 2023-10-12 08:37:53,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.744e+02 1.912e+02 2.113e+02 3.274e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-12 08:38:08,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1000234.6666666666, ans=0.125 2023-10-12 08:38:08,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1000234.6666666666, ans=0.07 2023-10-12 08:38:48,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=22.5 2023-10-12 08:39:19,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1000561.3333333334, ans=0.125 2023-10-12 08:39:30,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.89 vs. limit=22.5 2023-10-12 08:39:35,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.34 vs. limit=15.0 2023-10-12 08:39:39,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1000654.6666666666, ans=0.125 2023-10-12 08:39:44,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.782e+02 1.948e+02 2.242e+02 3.446e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-12 08:39:48,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1000654.6666666666, ans=0.2 2023-10-12 08:39:51,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.69 vs. limit=22.5 2023-10-12 08:40:02,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1000748.0, ans=0.1 2023-10-12 08:40:14,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000794.6666666666, ans=0.1 2023-10-12 08:40:14,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000794.6666666666, ans=0.1 2023-10-12 08:40:17,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1000794.6666666666, ans=0.2 2023-10-12 08:40:22,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1000794.6666666666, ans=0.1 2023-10-12 08:40:23,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1000841.3333333334, ans=0.1 2023-10-12 08:40:23,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2023-10-12 08:40:36,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.99 vs. limit=10.0 2023-10-12 08:40:55,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1000934.6666666666, ans=0.2 2023-10-12 08:41:07,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1000981.3333333334, ans=0.125 2023-10-12 08:41:09,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1000981.3333333334, ans=0.2 2023-10-12 08:41:10,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001028.0, ans=0.1 2023-10-12 08:41:17,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1001028.0, ans=0.2 2023-10-12 08:41:24,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1001074.6666666666, ans=0.0 2023-10-12 08:41:30,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1001074.6666666666, ans=0.125 2023-10-12 08:41:32,873 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-10-12 08:41:38,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.707e+02 1.840e+02 2.021e+02 2.708e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-12 08:41:43,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1001168.0, ans=0.5 2023-10-12 08:42:10,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1001261.3333333334, ans=0.125 2023-10-12 08:42:17,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1001308.0, ans=0.125 2023-10-12 08:42:26,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1001354.6666666666, ans=0.125 2023-10-12 08:42:40,048 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2023-10-12 08:42:46,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1001401.3333333334, ans=0.0 2023-10-12 08:42:51,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1001448.0, ans=0.125 2023-10-12 08:42:58,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=12.0 2023-10-12 08:43:00,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1001494.6666666666, ans=0.125 2023-10-12 08:43:06,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1001494.6666666666, ans=0.125 2023-10-12 08:43:16,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1001541.3333333334, ans=0.125 2023-10-12 08:43:21,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1001541.3333333334, ans=0.0 2023-10-12 08:43:30,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.736e+02 1.891e+02 2.112e+02 3.037e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-12 08:43:50,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1001681.3333333334, ans=0.125 2023-10-12 08:43:59,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1001728.0, ans=0.0 2023-10-12 08:44:16,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.03 vs. limit=15.0 2023-10-12 08:44:33,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1001868.0, ans=0.125 2023-10-12 08:44:33,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1001868.0, ans=0.125 2023-10-12 08:44:41,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1001914.6666666666, ans=0.025 2023-10-12 08:44:42,398 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:44:46,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2023-10-12 08:44:54,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001961.3333333334, ans=0.1 2023-10-12 08:44:56,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1001961.3333333334, ans=0.0 2023-10-12 08:45:10,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1002008.0, ans=0.0 2023-10-12 08:45:17,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1002054.6666666666, ans=0.1 2023-10-12 08:45:22,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.640e+02 1.822e+02 1.991e+02 3.118e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-12 08:45:34,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002101.3333333334, ans=0.125 2023-10-12 08:45:48,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1002194.6666666666, ans=0.07 2023-10-12 08:46:12,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1002288.0, ans=0.0 2023-10-12 08:46:21,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1002334.6666666666, ans=0.0 2023-10-12 08:46:33,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1002381.3333333334, ans=0.2 2023-10-12 08:46:43,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.40 vs. limit=15.0 2023-10-12 08:46:59,966 INFO [train.py:1031] (3/4) Epoch 16, batch 10000, loss[loss=0.2073, simple_loss=0.29, pruned_loss=0.06234, over 16834.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2849, pruned_loss=0.05194, over 32584752.69 frames. ], batch size: 175, lr: 2.17e-03, grad_scale: 16.0 2023-10-12 08:47:01,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1002521.3333333334, ans=0.125 2023-10-12 08:47:06,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.776e+02 1.934e+02 2.130e+02 3.412e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 08:47:07,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=12.0 2023-10-12 08:47:19,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1002614.6666666666, ans=0.125 2023-10-12 08:47:32,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1002661.3333333334, ans=0.0 2023-10-12 08:47:35,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1002661.3333333334, ans=0.125 2023-10-12 08:47:49,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1002708.0, ans=0.125 2023-10-12 08:47:50,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2023-10-12 08:47:57,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1002754.6666666666, ans=0.1 2023-10-12 08:48:39,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1002941.3333333334, ans=0.2 2023-10-12 08:48:54,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1002988.0, ans=0.07 2023-10-12 08:48:55,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1002988.0, ans=0.125 2023-10-12 08:48:58,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.771e+02 1.929e+02 2.157e+02 3.084e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-12 08:49:06,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1003034.6666666666, ans=0.1 2023-10-12 08:49:06,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1003034.6666666666, ans=0.025 2023-10-12 08:49:18,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2023-10-12 08:49:18,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1003081.3333333334, ans=0.0 2023-10-12 08:49:30,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2023-10-12 08:49:40,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1003174.6666666666, ans=0.125 2023-10-12 08:49:40,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1003174.6666666666, ans=0.125 2023-10-12 08:49:57,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1003268.0, ans=0.1 2023-10-12 08:50:05,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1003268.0, ans=0.05 2023-10-12 08:50:10,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1003314.6666666666, ans=0.125 2023-10-12 08:50:21,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1003361.3333333334, ans=0.1 2023-10-12 08:50:30,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=12.0 2023-10-12 08:50:50,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.707e+02 1.827e+02 2.030e+02 3.084e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-12 08:51:00,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.62 vs. limit=22.5 2023-10-12 08:51:06,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.48 vs. limit=15.0 2023-10-12 08:51:44,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1003688.0, ans=0.125 2023-10-12 08:51:47,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1003688.0, ans=0.125 2023-10-12 08:52:06,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1003781.3333333334, ans=0.125 2023-10-12 08:52:35,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-12 08:52:41,187 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:52:42,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1003921.3333333334, ans=0.0 2023-10-12 08:52:45,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1003921.3333333334, ans=0.125 2023-10-12 08:52:49,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.718e+02 1.866e+02 2.048e+02 3.006e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-12 08:53:09,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1004014.6666666666, ans=0.1 2023-10-12 08:53:14,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1004014.6666666666, ans=0.05 2023-10-12 08:53:19,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1004061.3333333334, ans=0.05 2023-10-12 08:53:25,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1004061.3333333334, ans=0.125 2023-10-12 08:53:53,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1004201.3333333334, ans=0.125 2023-10-12 08:53:54,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1004201.3333333334, ans=0.0 2023-10-12 08:54:03,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.55 vs. limit=15.0 2023-10-12 08:54:13,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1004248.0, ans=0.0 2023-10-12 08:54:32,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1004341.3333333334, ans=0.125 2023-10-12 08:54:35,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-12 08:54:36,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1004341.3333333334, ans=0.0 2023-10-12 08:54:39,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1004388.0, ans=0.1 2023-10-12 08:54:46,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.777e+02 1.914e+02 2.136e+02 3.654e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-12 08:54:54,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1004434.6666666666, ans=0.07 2023-10-12 08:55:03,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1004481.3333333334, ans=0.125 2023-10-12 08:55:22,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1004528.0, ans=0.125 2023-10-12 08:55:23,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1004528.0, ans=0.125 2023-10-12 08:55:26,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1004574.6666666666, ans=0.1 2023-10-12 08:55:27,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2023-10-12 08:55:41,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1004621.3333333334, ans=0.125 2023-10-12 08:55:43,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1004621.3333333334, ans=0.05 2023-10-12 08:55:45,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1004621.3333333334, ans=0.2 2023-10-12 08:55:47,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1004668.0, ans=0.125 2023-10-12 08:55:48,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1004668.0, ans=0.09899494936611666 2023-10-12 08:56:07,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1004714.6666666666, ans=15.0 2023-10-12 08:56:13,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1004761.3333333334, ans=0.125 2023-10-12 08:56:28,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1004808.0, ans=0.1 2023-10-12 08:56:30,445 INFO [train.py:1031] (3/4) Epoch 16, batch 10500, loss[loss=0.2014, simple_loss=0.2916, pruned_loss=0.0556, over 16798.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2852, pruned_loss=0.05192, over 32636754.44 frames. ], batch size: 175, lr: 2.17e-03, grad_scale: 16.0 2023-10-12 08:56:31,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1004854.6666666666, ans=0.5 2023-10-12 08:56:38,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.705e+02 1.889e+02 2.105e+02 2.689e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 08:57:03,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1004994.6666666666, ans=0.0 2023-10-12 08:57:09,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1004994.6666666666, ans=0.125 2023-10-12 08:57:42,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1005134.6666666666, ans=0.125 2023-10-12 08:57:43,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.13 vs. limit=15.0 2023-10-12 08:57:50,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1005181.3333333334, ans=0.125 2023-10-12 08:57:55,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1005181.3333333334, ans=0.2 2023-10-12 08:58:08,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1005228.0, ans=0.1 2023-10-12 08:58:11,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=22.5 2023-10-12 08:58:12,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1005228.0, ans=0.0 2023-10-12 08:58:25,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1005274.6666666666, ans=0.0 2023-10-12 08:58:34,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.741e+02 1.855e+02 2.083e+02 2.812e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-12 08:58:40,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1005368.0, ans=0.0 2023-10-12 08:58:49,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1005414.6666666666, ans=0.125 2023-10-12 08:58:49,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1005414.6666666666, ans=0.0 2023-10-12 09:00:11,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1005741.3333333334, ans=0.0 2023-10-12 09:00:15,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1005741.3333333334, ans=0.125 2023-10-12 09:00:29,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.752e+02 1.937e+02 2.193e+02 3.672e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 09:00:35,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1005834.6666666666, ans=0.025 2023-10-12 09:00:42,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1005834.6666666666, ans=0.0 2023-10-12 09:01:02,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1005928.0, ans=0.125 2023-10-12 09:01:04,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=22.5 2023-10-12 09:01:14,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1005974.6666666666, ans=0.09899494936611666 2023-10-12 09:01:36,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006068.0, ans=0.1 2023-10-12 09:01:42,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.16 vs. limit=10.0 2023-10-12 09:01:42,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006114.6666666666, ans=0.1 2023-10-12 09:02:21,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1006254.6666666666, ans=0.0 2023-10-12 09:02:22,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.769e+02 1.877e+02 2.188e+02 3.282e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-12 09:02:27,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1006301.3333333334, ans=0.0 2023-10-12 09:02:34,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1006301.3333333334, ans=0.125 2023-10-12 09:02:38,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.62 vs. limit=15.0 2023-10-12 09:02:43,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1006348.0, ans=0.2 2023-10-12 09:03:24,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1006534.6666666666, ans=0.125 2023-10-12 09:03:28,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1006534.6666666666, ans=0.0 2023-10-12 09:03:34,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-12 09:03:42,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1006628.0, ans=0.125 2023-10-12 09:03:43,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1006628.0, ans=0.0 2023-10-12 09:03:57,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-10-12 09:04:12,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-10-12 09:04:14,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.595e+02 1.768e+02 2.018e+02 2.913e+02, threshold=3.537e+02, percent-clipped=0.0 2023-10-12 09:04:22,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1006768.0, ans=0.125 2023-10-12 09:04:29,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1006814.6666666666, ans=0.1 2023-10-12 09:04:36,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1006814.6666666666, ans=15.0 2023-10-12 09:04:37,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1006814.6666666666, ans=0.125 2023-10-12 09:04:45,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1006861.3333333334, ans=0.125 2023-10-12 09:05:03,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1006954.6666666666, ans=0.125 2023-10-12 09:05:03,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1006954.6666666666, ans=0.0 2023-10-12 09:05:16,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1007001.3333333334, ans=0.125 2023-10-12 09:05:21,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1007048.0, ans=0.0 2023-10-12 09:05:38,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1007094.6666666666, ans=0.125 2023-10-12 09:05:55,487 INFO [train.py:1031] (3/4) Epoch 16, batch 11000, loss[loss=0.213, simple_loss=0.3072, pruned_loss=0.05937, over 16627.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2853, pruned_loss=0.05197, over 32662340.42 frames. ], batch size: 220, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:06:02,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.860e+02 2.060e+02 2.363e+02 3.305e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-12 09:06:12,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1007234.6666666666, ans=0.125 2023-10-12 09:06:30,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1007328.0, ans=0.04949747468305833 2023-10-12 09:06:34,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1007328.0, ans=0.0 2023-10-12 09:06:41,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1007374.6666666666, ans=0.0 2023-10-12 09:06:51,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-10-12 09:06:54,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1007421.3333333334, ans=0.125 2023-10-12 09:07:08,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1007468.0, ans=0.125 2023-10-12 09:07:21,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1007514.6666666666, ans=0.125 2023-10-12 09:07:27,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1007561.3333333334, ans=0.125 2023-10-12 09:07:36,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1007608.0, ans=0.125 2023-10-12 09:07:59,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.697e+02 1.925e+02 2.135e+02 3.411e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-12 09:08:00,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1007654.6666666666, ans=0.0 2023-10-12 09:08:04,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.20 vs. limit=15.0 2023-10-12 09:08:15,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1007701.3333333334, ans=0.125 2023-10-12 09:08:25,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1007748.0, ans=0.04949747468305833 2023-10-12 09:08:40,672 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=12.0 2023-10-12 09:08:46,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.53 vs. limit=10.0 2023-10-12 09:08:46,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007841.3333333334, ans=0.1 2023-10-12 09:08:47,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.44 vs. limit=15.0 2023-10-12 09:08:49,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1007841.3333333334, ans=0.0 2023-10-12 09:09:10,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007934.6666666666, ans=0.1 2023-10-12 09:09:11,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1007934.6666666666, ans=0.125 2023-10-12 09:09:12,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1007934.6666666666, ans=0.125 2023-10-12 09:09:57,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008121.3333333334, ans=0.1 2023-10-12 09:09:58,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.635e+02 1.787e+02 2.079e+02 3.244e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-12 09:10:02,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=12.0 2023-10-12 09:10:04,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1008168.0, ans=0.0 2023-10-12 09:10:05,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1008168.0, ans=0.1 2023-10-12 09:10:08,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.44 vs. limit=15.0 2023-10-12 09:10:10,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1008214.6666666666, ans=0.0 2023-10-12 09:10:22,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-10-12 09:10:46,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1008354.6666666666, ans=0.125 2023-10-12 09:11:10,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008448.0, ans=0.1 2023-10-12 09:11:15,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008448.0, ans=0.1 2023-10-12 09:11:23,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1008494.6666666666, ans=0.07 2023-10-12 09:11:31,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1008494.6666666666, ans=0.0 2023-10-12 09:11:34,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1008494.6666666666, ans=0.0 2023-10-12 09:11:37,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1008541.3333333334, ans=0.0 2023-10-12 09:11:45,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008541.3333333334, ans=0.1 2023-10-12 09:11:48,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1008588.0, ans=0.125 2023-10-12 09:11:55,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.752e+02 1.933e+02 2.140e+02 3.243e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-12 09:12:12,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008681.3333333334, ans=0.1 2023-10-12 09:12:45,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.96 vs. limit=15.0 2023-10-12 09:12:47,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1008821.3333333334, ans=0.0 2023-10-12 09:12:48,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1008821.3333333334, ans=0.125 2023-10-12 09:13:00,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1008868.0, ans=0.025 2023-10-12 09:13:10,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1008914.6666666666, ans=0.125 2023-10-12 09:13:16,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008914.6666666666, ans=0.1 2023-10-12 09:13:17,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1008961.3333333334, ans=0.5 2023-10-12 09:13:26,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1008961.3333333334, ans=0.0 2023-10-12 09:13:30,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.88 vs. limit=22.5 2023-10-12 09:13:36,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1009008.0, ans=0.125 2023-10-12 09:13:46,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.804e+02 1.965e+02 2.230e+02 2.923e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 09:13:49,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.16 vs. limit=22.5 2023-10-12 09:14:12,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.17 vs. limit=15.0 2023-10-12 09:14:25,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.76 vs. limit=10.0 2023-10-12 09:14:44,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1009288.0, ans=0.125 2023-10-12 09:14:50,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1009334.6666666666, ans=0.1 2023-10-12 09:14:59,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1009381.3333333334, ans=0.5 2023-10-12 09:15:16,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2023-10-12 09:15:22,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-10-12 09:15:28,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-12 09:15:31,518 INFO [train.py:1031] (3/4) Epoch 16, batch 11500, loss[loss=0.1957, simple_loss=0.2966, pruned_loss=0.04737, over 16810.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.285, pruned_loss=0.05187, over 32666397.42 frames. ], batch size: 175, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:15:38,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.765e+02 1.964e+02 2.149e+02 3.230e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 09:15:40,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1009521.3333333334, ans=0.0 2023-10-12 09:15:48,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1009568.0, ans=0.125 2023-10-12 09:15:52,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1009568.0, ans=0.0 2023-10-12 09:16:24,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1009708.0, ans=0.2 2023-10-12 09:16:37,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1009754.6666666666, ans=0.125 2023-10-12 09:16:38,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.38 vs. limit=10.0 2023-10-12 09:16:47,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1009801.3333333334, ans=0.0 2023-10-12 09:17:04,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1009848.0, ans=0.125 2023-10-12 09:17:09,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1009848.0, ans=0.0 2023-10-12 09:17:15,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1009894.6666666666, ans=0.0 2023-10-12 09:17:22,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1009941.3333333334, ans=0.125 2023-10-12 09:17:25,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1009941.3333333334, ans=0.125 2023-10-12 09:17:29,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1009941.3333333334, ans=0.0 2023-10-12 09:17:33,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1009941.3333333334, ans=0.125 2023-10-12 09:17:38,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1009988.0, ans=0.1 2023-10-12 09:17:43,595 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.788e+02 2.032e+02 2.246e+02 3.203e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-12 09:17:58,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1010081.3333333334, ans=0.125 2023-10-12 09:18:18,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-10-12 09:18:40,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010268.0, ans=0.1 2023-10-12 09:18:40,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.33 vs. limit=10.0 2023-10-12 09:18:44,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1010268.0, ans=0.95 2023-10-12 09:18:56,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1010314.6666666666, ans=0.125 2023-10-12 09:19:07,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010361.3333333334, ans=0.1 2023-10-12 09:19:11,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1010408.0, ans=0.125 2023-10-12 09:19:12,717 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:19:20,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-12 09:19:21,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1010408.0, ans=0.0 2023-10-12 09:19:26,682 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:19:26,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1010454.6666666666, ans=0.125 2023-10-12 09:19:30,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010454.6666666666, ans=0.1 2023-10-12 09:19:31,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.659e+02 1.848e+02 1.990e+02 2.899e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-12 09:19:35,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1010501.3333333334, ans=0.125 2023-10-12 09:19:38,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1010501.3333333334, ans=0.125 2023-10-12 09:20:09,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1010641.3333333334, ans=0.125 2023-10-12 09:20:14,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010641.3333333334, ans=0.1 2023-10-12 09:20:33,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010688.0, ans=0.1 2023-10-12 09:20:48,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1010734.6666666666, ans=0.125 2023-10-12 09:21:18,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.31 vs. limit=15.0 2023-10-12 09:21:24,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1010874.6666666666, ans=0.1 2023-10-12 09:21:35,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.709e+02 1.866e+02 2.133e+02 2.961e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-12 09:22:23,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1011108.0, ans=0.125 2023-10-12 09:22:32,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1011154.6666666666, ans=0.125 2023-10-12 09:22:37,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-10-12 09:22:40,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1011201.3333333334, ans=0.0 2023-10-12 09:22:40,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1011201.3333333334, ans=0.0 2023-10-12 09:22:54,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1011248.0, ans=0.09899494936611666 2023-10-12 09:23:03,275 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=6.0 2023-10-12 09:23:06,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1011294.6666666666, ans=0.125 2023-10-12 09:23:22,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1011341.3333333334, ans=0.125 2023-10-12 09:23:22,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.79 vs. limit=15.0 2023-10-12 09:23:37,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.747e+02 1.921e+02 2.164e+02 2.933e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 09:23:51,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1011481.3333333334, ans=0.125 2023-10-12 09:23:59,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1011481.3333333334, ans=0.2 2023-10-12 09:24:17,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=22.5 2023-10-12 09:24:19,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1011574.6666666666, ans=0.125 2023-10-12 09:24:40,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1011668.0, ans=0.125 2023-10-12 09:24:55,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1011714.6666666666, ans=0.1 2023-10-12 09:24:57,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=15.0 2023-10-12 09:25:00,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1011761.3333333334, ans=0.0 2023-10-12 09:25:02,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1011761.3333333334, ans=0.2 2023-10-12 09:25:03,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1011761.3333333334, ans=0.125 2023-10-12 09:25:16,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1011808.0, ans=0.0 2023-10-12 09:25:23,722 INFO [train.py:1031] (3/4) Epoch 16, batch 12000, loss[loss=0.1988, simple_loss=0.2891, pruned_loss=0.05423, over 16878.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2852, pruned_loss=0.05174, over 32699297.86 frames. ], batch size: 110, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:25:34,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.720e+02 1.883e+02 2.167e+02 3.151e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-12 09:25:35,549 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=12.0 2023-10-12 09:25:39,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1011901.3333333334, ans=0.125 2023-10-12 09:26:27,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1012088.0, ans=0.125 2023-10-12 09:26:50,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1012181.3333333334, ans=0.125 2023-10-12 09:27:19,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1012321.3333333334, ans=0.2 2023-10-12 09:27:26,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.698e+02 1.943e+02 2.263e+02 3.329e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-12 09:27:30,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1012368.0, ans=0.1 2023-10-12 09:27:49,854 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.76 vs. limit=6.0 2023-10-12 09:28:02,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1012508.0, ans=0.0 2023-10-12 09:28:10,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-12 09:28:23,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1012601.3333333334, ans=0.1 2023-10-12 09:28:32,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1012648.0, ans=0.1 2023-10-12 09:28:42,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1012648.0, ans=0.2 2023-10-12 09:28:42,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1012648.0, ans=0.125 2023-10-12 09:29:06,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1012788.0, ans=0.125 2023-10-12 09:29:15,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.746e+02 1.954e+02 2.236e+02 3.777e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-12 09:29:23,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=22.5 2023-10-12 09:29:32,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1012881.3333333334, ans=0.125 2023-10-12 09:29:34,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1012881.3333333334, ans=0.125 2023-10-12 09:29:54,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1012974.6666666666, ans=0.125 2023-10-12 09:30:01,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.32 vs. limit=15.0 2023-10-12 09:30:05,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1013021.3333333334, ans=0.0 2023-10-12 09:30:07,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1013021.3333333334, ans=0.1 2023-10-12 09:30:09,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=1013021.3333333334, ans=0.02 2023-10-12 09:30:19,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-12 09:30:32,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013114.6666666666, ans=0.1 2023-10-12 09:30:56,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-10-12 09:31:08,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.740e+02 1.958e+02 2.137e+02 2.610e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-12 09:31:21,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1013348.0, ans=0.125 2023-10-12 09:31:25,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1013348.0, ans=0.125 2023-10-12 09:31:27,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1013348.0, ans=0.2 2023-10-12 09:31:30,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1013394.6666666666, ans=0.0 2023-10-12 09:31:34,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.41 vs. limit=15.0 2023-10-12 09:31:49,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1013441.3333333334, ans=0.125 2023-10-12 09:31:54,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1013488.0, ans=0.125 2023-10-12 09:32:11,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1013534.6666666666, ans=0.2 2023-10-12 09:32:12,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1013534.6666666666, ans=0.125 2023-10-12 09:33:02,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.761e+02 1.928e+02 2.131e+02 2.863e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-12 09:33:08,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1013768.0, ans=0.2 2023-10-12 09:33:22,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013814.6666666666, ans=0.1 2023-10-12 09:33:36,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1013908.0, ans=0.125 2023-10-12 09:34:10,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1014001.3333333334, ans=0.125 2023-10-12 09:34:25,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1014094.6666666666, ans=0.125 2023-10-12 09:34:30,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1014094.6666666666, ans=0.0 2023-10-12 09:34:30,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1014094.6666666666, ans=0.125 2023-10-12 09:34:44,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1014141.3333333334, ans=0.125 2023-10-12 09:34:46,849 INFO [train.py:1031] (3/4) Epoch 16, batch 12500, loss[loss=0.2404, simple_loss=0.3118, pruned_loss=0.08451, over 15683.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.285, pruned_loss=0.05183, over 32728740.86 frames. ], batch size: 350, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:34:48,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1014188.0, ans=0.125 2023-10-12 09:34:56,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1014188.0, ans=0.125 2023-10-12 09:34:57,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.760e+02 2.014e+02 2.324e+02 3.104e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-12 09:35:22,749 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=15.0 2023-10-12 09:35:36,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014374.6666666666, ans=0.1 2023-10-12 09:35:37,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1014374.6666666666, ans=0.125 2023-10-12 09:35:57,466 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:36:17,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1014561.3333333334, ans=0.125 2023-10-12 09:36:22,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1014561.3333333334, ans=0.2 2023-10-12 09:36:31,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1014608.0, ans=0.125 2023-10-12 09:36:43,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1014654.6666666666, ans=0.0 2023-10-12 09:36:49,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-10-12 09:36:50,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.670e+02 1.844e+02 2.021e+02 2.772e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-12 09:37:07,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1014748.0, ans=0.125 2023-10-12 09:37:45,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1014888.0, ans=0.125 2023-10-12 09:37:50,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1014934.6666666666, ans=0.125 2023-10-12 09:37:54,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1014934.6666666666, ans=0.125 2023-10-12 09:38:43,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.715e+02 1.910e+02 2.160e+02 3.053e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 09:39:03,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1015261.3333333334, ans=0.125 2023-10-12 09:39:37,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.36 vs. limit=15.0 2023-10-12 09:39:46,982 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:40:21,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-10-12 09:40:33,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2023-10-12 09:40:37,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1015588.0, ans=0.2 2023-10-12 09:40:43,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.765e+02 1.949e+02 2.279e+02 3.238e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-12 09:40:44,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2023-10-12 09:40:44,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1015634.6666666666, ans=0.0 2023-10-12 09:40:46,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015634.6666666666, ans=0.125 2023-10-12 09:40:48,957 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:41:47,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1015868.0, ans=0.125 2023-10-12 09:42:21,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1016008.0, ans=0.95 2023-10-12 09:42:22,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1016008.0, ans=0.0 2023-10-12 09:42:25,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1016054.6666666666, ans=0.125 2023-10-12 09:42:28,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1016054.6666666666, ans=0.0 2023-10-12 09:42:37,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.688e+02 1.936e+02 2.265e+02 4.302e+02, threshold=3.871e+02, percent-clipped=1.0 2023-10-12 09:42:48,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1016148.0, ans=0.125 2023-10-12 09:42:51,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1016148.0, ans=0.5 2023-10-12 09:42:52,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1016148.0, ans=0.125 2023-10-12 09:43:04,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1016194.6666666666, ans=0.0 2023-10-12 09:43:13,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1016241.3333333334, ans=0.125 2023-10-12 09:43:13,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1016241.3333333334, ans=0.125 2023-10-12 09:43:24,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1016288.0, ans=0.1 2023-10-12 09:44:15,993 INFO [train.py:1031] (3/4) Epoch 16, batch 13000, loss[loss=0.1925, simple_loss=0.2742, pruned_loss=0.05534, over 15947.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2854, pruned_loss=0.05189, over 32717700.15 frames. ], batch size: 43, lr: 2.15e-03, grad_scale: 16.0 2023-10-12 09:44:24,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1016521.3333333334, ans=0.0 2023-10-12 09:44:28,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.703e+02 1.845e+02 2.129e+02 2.723e+02, threshold=3.691e+02, percent-clipped=0.0 2023-10-12 09:44:54,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2023-10-12 09:45:00,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1016661.3333333334, ans=0.125 2023-10-12 09:45:13,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1016708.0, ans=0.0 2023-10-12 09:45:27,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1016754.6666666666, ans=0.2 2023-10-12 09:45:34,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-10-12 09:46:07,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-10-12 09:46:28,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1016988.0, ans=0.0 2023-10-12 09:46:30,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1016988.0, ans=0.0 2023-10-12 09:46:34,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.724e+02 1.899e+02 2.148e+02 3.210e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-12 09:46:50,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1017081.3333333334, ans=0.125 2023-10-12 09:46:53,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1017081.3333333334, ans=0.125 2023-10-12 09:46:58,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=22.5 2023-10-12 09:47:04,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1017128.0, ans=0.125 2023-10-12 09:47:13,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1017174.6666666666, ans=0.0 2023-10-12 09:47:24,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-12 09:48:07,740 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.39 vs. limit=15.0 2023-10-12 09:48:09,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=22.5 2023-10-12 09:48:13,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1017408.0, ans=0.125 2023-10-12 09:48:28,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-10-12 09:48:31,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1017501.3333333334, ans=0.1 2023-10-12 09:48:33,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.670e+02 1.905e+02 2.095e+02 2.895e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-12 09:48:38,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1017501.3333333334, ans=0.2 2023-10-12 09:48:38,703 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:48:47,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-12 09:48:49,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1017548.0, ans=0.2 2023-10-12 09:49:02,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1017594.6666666666, ans=0.125 2023-10-12 09:49:32,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017734.6666666666, ans=0.1 2023-10-12 09:49:34,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.61 vs. limit=15.0 2023-10-12 09:50:21,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.740e+02 1.869e+02 2.083e+02 3.079e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 09:50:42,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-12 09:50:45,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1018061.3333333334, ans=0.125 2023-10-12 09:51:01,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1018108.0, ans=0.0 2023-10-12 09:51:05,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-10-12 09:51:10,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1018154.6666666666, ans=0.125 2023-10-12 09:51:26,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1018248.0, ans=0.0 2023-10-12 09:51:43,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1018294.6666666666, ans=0.0 2023-10-12 09:51:45,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018294.6666666666, ans=0.1 2023-10-12 09:51:52,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1018341.3333333334, ans=0.125 2023-10-12 09:52:11,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.774e+02 1.938e+02 2.154e+02 3.080e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-12 09:52:12,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1018434.6666666666, ans=0.1 2023-10-12 09:52:14,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1018434.6666666666, ans=0.2 2023-10-12 09:52:33,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=15.0 2023-10-12 09:52:36,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1018528.0, ans=0.0 2023-10-12 09:52:38,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1018528.0, ans=0.1 2023-10-12 09:52:39,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1018528.0, ans=0.125 2023-10-12 09:52:41,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1018574.6666666666, ans=0.2 2023-10-12 09:52:52,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1018574.6666666666, ans=0.125 2023-10-12 09:52:53,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.52 vs. limit=22.5 2023-10-12 09:53:03,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-12 09:53:04,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1018668.0, ans=0.125 2023-10-12 09:53:09,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1018668.0, ans=0.025 2023-10-12 09:53:15,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1018714.6666666666, ans=0.125 2023-10-12 09:53:16,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1018714.6666666666, ans=0.035 2023-10-12 09:53:23,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.62 vs. limit=6.0 2023-10-12 09:53:28,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1018761.3333333334, ans=0.125 2023-10-12 09:53:34,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1018808.0, ans=0.125 2023-10-12 09:53:44,261 INFO [train.py:1031] (3/4) Epoch 16, batch 13500, loss[loss=0.2103, simple_loss=0.2906, pruned_loss=0.06501, over 15923.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2848, pruned_loss=0.05158, over 32762082.51 frames. ], batch size: 296, lr: 2.15e-03, grad_scale: 16.0 2023-10-12 09:53:44,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1018854.6666666666, ans=0.1 2023-10-12 09:53:47,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1018854.6666666666, ans=0.1 2023-10-12 09:53:49,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-10-12 09:53:57,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.710e+02 1.854e+02 2.039e+02 2.657e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 09:54:15,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1018948.0, ans=0.1 2023-10-12 09:54:42,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019088.0, ans=0.1 2023-10-12 09:54:49,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1019088.0, ans=0.0 2023-10-12 09:54:52,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-12 09:54:52,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1019134.6666666666, ans=0.0 2023-10-12 09:54:53,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1019134.6666666666, ans=0.125 2023-10-12 09:55:01,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.81 vs. limit=15.0 2023-10-12 09:55:05,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.63 vs. limit=22.5 2023-10-12 09:55:17,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019228.0, ans=0.1 2023-10-12 09:55:20,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1019228.0, ans=0.2 2023-10-12 09:55:36,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1019321.3333333334, ans=0.2 2023-10-12 09:55:45,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.835e+02 2.092e+02 2.498e+02 3.722e+02, threshold=4.183e+02, percent-clipped=1.0 2023-10-12 09:56:09,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1019461.3333333334, ans=0.2 2023-10-12 09:56:13,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1019508.0, ans=0.0 2023-10-12 09:56:17,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019508.0, ans=0.1 2023-10-12 09:56:53,616 INFO [train.py:1031] (3/4) Epoch 17, batch 0, loss[loss=0.1608, simple_loss=0.2548, pruned_loss=0.03337, over 16884.00 frames. ], tot_loss[loss=0.1608, simple_loss=0.2548, pruned_loss=0.03337, over 16884.00 frames. ], batch size: 165, lr: 2.08e-03, grad_scale: 32.0 2023-10-12 09:56:53,617 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-12 09:56:57,488 INFO [zipformer.py:1853] (3/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.7857, 3.3097, 1.8839, 4.9144], device='cuda:3') 2023-10-12 09:57:00,675 INFO [train.py:1063] (3/4) Epoch 17, validation: loss=0.2156, simple_loss=0.3028, pruned_loss=0.06418, over 1020973.00 frames. 2023-10-12 09:57:00,676 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-12 09:57:22,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1019671.3333333334, ans=0.125 2023-10-12 09:57:23,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1019671.3333333334, ans=0.025 2023-10-12 09:57:23,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1019671.3333333334, ans=0.5 2023-10-12 09:57:24,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1019671.3333333334, ans=0.125 2023-10-12 09:57:24,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1019671.3333333334, ans=0.0 2023-10-12 09:57:42,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1019718.0, ans=0.125 2023-10-12 09:58:04,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.684e+02 1.856e+02 2.049e+02 2.950e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 09:58:12,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1019858.0, ans=0.125 2023-10-12 09:58:13,531 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:58:17,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1019858.0, ans=0.0 2023-10-12 09:58:24,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1019904.6666666666, ans=0.0 2023-10-12 09:58:29,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1019951.3333333334, ans=0.2 2023-10-12 09:58:38,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1019951.3333333334, ans=0.1 2023-10-12 09:58:45,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1019998.0, ans=0.0 2023-10-12 09:58:47,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1019998.0, ans=0.1 2023-10-12 09:59:06,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1020091.3333333334, ans=0.125 2023-10-12 09:59:48,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1020278.0, ans=0.2 2023-10-12 09:59:54,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.661e+02 1.802e+02 2.031e+02 3.220e+02, threshold=3.604e+02, percent-clipped=0.0 2023-10-12 10:00:14,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020371.3333333334, ans=0.1 2023-10-12 10:00:16,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-10-12 10:00:19,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1020418.0, ans=0.0 2023-10-12 10:00:34,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1020464.6666666666, ans=0.1 2023-10-12 10:00:34,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020464.6666666666, ans=0.1 2023-10-12 10:00:40,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1020511.3333333334, ans=0.2 2023-10-12 10:00:57,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1020558.0, ans=0.125 2023-10-12 10:01:03,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1020604.6666666666, ans=0.125 2023-10-12 10:01:07,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1020604.6666666666, ans=0.125 2023-10-12 10:01:22,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020698.0, ans=0.1 2023-10-12 10:01:38,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.92 vs. limit=22.5 2023-10-12 10:01:44,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1020744.6666666666, ans=0.125 2023-10-12 10:01:45,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1020744.6666666666, ans=10.0 2023-10-12 10:01:46,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.723e+02 1.860e+02 2.090e+02 2.958e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-12 10:02:10,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1020884.6666666666, ans=0.05 2023-10-12 10:02:23,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-10-12 10:02:24,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1020931.3333333334, ans=0.0 2023-10-12 10:02:35,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.00 vs. limit=10.0 2023-10-12 10:02:44,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.19 vs. limit=10.0 2023-10-12 10:02:57,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1021071.3333333334, ans=0.125 2023-10-12 10:03:03,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=22.5 2023-10-12 10:03:10,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1021118.0, ans=0.035 2023-10-12 10:03:32,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.748e+02 1.959e+02 2.174e+02 2.956e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 10:03:47,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1021304.6666666666, ans=0.0 2023-10-12 10:04:18,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1021444.6666666666, ans=0.07 2023-10-12 10:04:28,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1021491.3333333334, ans=0.0 2023-10-12 10:04:34,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2023-10-12 10:04:37,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1021491.3333333334, ans=0.1 2023-10-12 10:04:37,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1021491.3333333334, ans=0.125 2023-10-12 10:04:41,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021538.0, ans=0.1 2023-10-12 10:04:54,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-10-12 10:04:55,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1021584.6666666666, ans=0.125 2023-10-12 10:05:09,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1021631.3333333334, ans=0.0 2023-10-12 10:05:17,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1021678.0, ans=0.125 2023-10-12 10:05:23,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.702e+02 1.870e+02 2.098e+02 3.003e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 10:05:34,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1021724.6666666666, ans=0.0 2023-10-12 10:05:34,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1021771.3333333334, ans=0.125 2023-10-12 10:05:51,752 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:06:09,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-10-12 10:06:10,319 INFO [train.py:1031] (3/4) Epoch 17, batch 500, loss[loss=0.1682, simple_loss=0.2651, pruned_loss=0.03564, over 16963.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2846, pruned_loss=0.05196, over 7278881.64 frames. ], batch size: 117, lr: 2.08e-03, grad_scale: 32.0 2023-10-12 10:06:22,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021958.0, ans=0.1 2023-10-12 10:06:30,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1021958.0, ans=0.0 2023-10-12 10:06:53,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1022098.0, ans=0.0 2023-10-12 10:07:09,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1022144.6666666666, ans=0.0 2023-10-12 10:07:12,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.788e+02 2.025e+02 2.302e+02 2.968e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-12 10:07:15,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1022191.3333333334, ans=0.125 2023-10-12 10:07:21,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1022191.3333333334, ans=0.125 2023-10-12 10:07:23,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.64 vs. limit=15.0 2023-10-12 10:07:34,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=12.0 2023-10-12 10:07:36,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-10-12 10:07:39,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1022284.6666666666, ans=0.2 2023-10-12 10:07:45,430 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:08:14,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=15.0 2023-10-12 10:08:20,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1022471.3333333334, ans=0.2 2023-10-12 10:08:21,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=1022471.3333333334, ans=0.2 2023-10-12 10:08:22,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1022471.3333333334, ans=0.2 2023-10-12 10:08:35,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1022518.0, ans=0.0 2023-10-12 10:08:52,353 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:09:01,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.776e+02 1.916e+02 2.180e+02 3.143e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-12 10:09:07,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1022658.0, ans=0.125 2023-10-12 10:09:34,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1022798.0, ans=0.125 2023-10-12 10:09:41,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.17 vs. limit=15.0 2023-10-12 10:09:51,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1022844.6666666666, ans=0.125 2023-10-12 10:09:54,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1022844.6666666666, ans=0.0 2023-10-12 10:09:54,897 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=15.0 2023-10-12 10:09:59,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1022891.3333333334, ans=0.1 2023-10-12 10:10:02,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2023-10-12 10:10:09,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1022938.0, ans=0.2 2023-10-12 10:10:15,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1022938.0, ans=0.1 2023-10-12 10:10:21,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-10-12 10:10:30,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-10-12 10:10:47,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-10-12 10:10:53,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.787e+02 1.944e+02 2.153e+02 3.291e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 10:11:00,153 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.02 vs. limit=15.0 2023-10-12 10:11:15,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1023218.0, ans=0.125 2023-10-12 10:11:22,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=8.0 2023-10-12 10:11:29,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.52 vs. limit=10.0 2023-10-12 10:11:40,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-10-12 10:11:49,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-10-12 10:11:56,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1023404.6666666666, ans=0.0 2023-10-12 10:11:59,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023404.6666666666, ans=0.1 2023-10-12 10:12:00,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023404.6666666666, ans=0.1 2023-10-12 10:12:29,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2023-10-12 10:12:33,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1023498.0, ans=0.125 2023-10-12 10:12:47,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.718e+02 1.856e+02 2.043e+02 4.318e+02, threshold=3.711e+02, percent-clipped=1.0 2023-10-12 10:12:47,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1023591.3333333334, ans=0.125 2023-10-12 10:12:50,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1023591.3333333334, ans=0.0 2023-10-12 10:13:02,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-12 10:13:20,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1023731.3333333334, ans=0.125 2023-10-12 10:13:27,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1023731.3333333334, ans=0.0 2023-10-12 10:13:38,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-10-12 10:13:40,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1023824.6666666666, ans=0.025 2023-10-12 10:13:55,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1023871.3333333334, ans=0.0 2023-10-12 10:14:00,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1023871.3333333334, ans=0.2 2023-10-12 10:14:24,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1023964.6666666666, ans=0.125 2023-10-12 10:14:25,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1024011.3333333334, ans=0.125 2023-10-12 10:14:32,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1024011.3333333334, ans=0.125 2023-10-12 10:14:36,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.700e+02 1.887e+02 2.175e+02 2.907e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 10:14:40,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1024058.0, ans=0.2 2023-10-12 10:14:42,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1024058.0, ans=0.0 2023-10-12 10:15:10,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1024198.0, ans=0.2 2023-10-12 10:15:19,572 INFO [train.py:1031] (3/4) Epoch 17, batch 1000, loss[loss=0.1919, simple_loss=0.2903, pruned_loss=0.04679, over 16852.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2861, pruned_loss=0.05257, over 12920344.63 frames. ], batch size: 188, lr: 2.08e-03, grad_scale: 16.0 2023-10-12 10:15:27,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1024244.6666666666, ans=0.125 2023-10-12 10:15:48,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1024338.0, ans=0.125 2023-10-12 10:15:54,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1024384.6666666666, ans=0.125 2023-10-12 10:15:57,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1024384.6666666666, ans=0.125 2023-10-12 10:16:09,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-10-12 10:16:18,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1024478.0, ans=0.0 2023-10-12 10:16:18,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-10-12 10:16:20,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.725e+02 1.898e+02 2.216e+02 3.881e+02, threshold=3.796e+02, percent-clipped=1.0 2023-10-12 10:16:22,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1024524.6666666666, ans=0.0 2023-10-12 10:16:22,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1024524.6666666666, ans=0.125 2023-10-12 10:16:27,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1024524.6666666666, ans=0.2 2023-10-12 10:16:51,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1024664.6666666666, ans=0.125 2023-10-12 10:17:02,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-10-12 10:17:12,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1024758.0, ans=0.125 2023-10-12 10:17:37,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1024804.6666666666, ans=0.125 2023-10-12 10:17:49,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1024851.3333333334, ans=0.125 2023-10-12 10:17:51,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1024898.0, ans=0.125 2023-10-12 10:17:56,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-10-12 10:17:56,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-12 10:18:06,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1024944.6666666666, ans=0.125 2023-10-12 10:18:13,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.752e+02 1.936e+02 2.247e+02 3.675e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-12 10:18:21,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1024991.3333333334, ans=0.125 2023-10-12 10:18:23,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1024991.3333333334, ans=0.0 2023-10-12 10:18:36,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1025038.0, ans=0.1 2023-10-12 10:18:44,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1025084.6666666666, ans=0.125 2023-10-12 10:18:46,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1025084.6666666666, ans=0.125 2023-10-12 10:18:46,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1025084.6666666666, ans=0.125 2023-10-12 10:18:50,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1025084.6666666666, ans=0.125 2023-10-12 10:18:56,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.91 vs. limit=15.0 2023-10-12 10:19:13,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-10-12 10:19:13,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1025224.6666666666, ans=0.0 2023-10-12 10:19:16,263 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:19:20,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1025224.6666666666, ans=0.125 2023-10-12 10:19:36,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1025271.3333333334, ans=0.0 2023-10-12 10:19:46,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1025318.0, ans=0.125 2023-10-12 10:20:03,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1025411.3333333334, ans=0.2 2023-10-12 10:20:09,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.782e+02 1.935e+02 2.155e+02 3.106e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 10:20:11,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=12.0 2023-10-12 10:20:18,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1025458.0, ans=0.125 2023-10-12 10:21:14,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1025738.0, ans=0.0 2023-10-12 10:21:14,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1025738.0, ans=0.125 2023-10-12 10:21:29,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1025784.6666666666, ans=0.125 2023-10-12 10:21:38,858 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:21:51,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1025878.0, ans=0.125 2023-10-12 10:21:56,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.631e+02 1.861e+02 2.094e+02 2.940e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-12 10:21:56,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1025924.6666666666, ans=0.2 2023-10-12 10:22:30,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-10-12 10:22:40,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1026111.3333333334, ans=0.125 2023-10-12 10:22:56,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1026158.0, ans=0.0 2023-10-12 10:22:59,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1026158.0, ans=0.1 2023-10-12 10:23:11,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-10-12 10:23:42,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1026344.6666666666, ans=0.0 2023-10-12 10:23:49,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.744e+02 1.898e+02 2.138e+02 2.962e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-12 10:23:50,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026391.3333333334, ans=0.1 2023-10-12 10:24:07,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1026438.0, ans=0.125 2023-10-12 10:24:09,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.74 vs. limit=10.0 2023-10-12 10:24:34,179 INFO [train.py:1031] (3/4) Epoch 17, batch 1500, loss[loss=0.1783, simple_loss=0.2778, pruned_loss=0.03944, over 16936.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2842, pruned_loss=0.05168, over 17311727.81 frames. ], batch size: 138, lr: 2.07e-03, grad_scale: 16.0 2023-10-12 10:24:46,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1026624.6666666666, ans=0.0 2023-10-12 10:24:50,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1026624.6666666666, ans=0.125 2023-10-12 10:25:00,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1026671.3333333334, ans=0.0 2023-10-12 10:25:29,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=12.0 2023-10-12 10:25:40,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1026858.0, ans=0.0 2023-10-12 10:25:40,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1026858.0, ans=0.025 2023-10-12 10:25:41,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.758e+02 1.908e+02 2.078e+02 3.026e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 10:25:44,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026858.0, ans=0.1 2023-10-12 10:25:49,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1026858.0, ans=0.125 2023-10-12 10:25:49,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-10-12 10:26:10,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2023-10-12 10:26:31,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-10-12 10:26:55,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1027138.0, ans=0.2 2023-10-12 10:27:01,582 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-10-12 10:27:17,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1027231.3333333334, ans=0.2 2023-10-12 10:27:23,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1027231.3333333334, ans=0.0 2023-10-12 10:27:33,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=12.0 2023-10-12 10:27:39,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.720e+02 1.918e+02 2.202e+02 3.123e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-12 10:27:43,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1027324.6666666666, ans=0.07 2023-10-12 10:27:51,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1027371.3333333334, ans=0.125 2023-10-12 10:28:00,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1027418.0, ans=0.2 2023-10-12 10:28:05,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1027418.0, ans=0.0 2023-10-12 10:28:08,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1027418.0, ans=0.0 2023-10-12 10:28:10,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.01 vs. limit=15.0 2023-10-12 10:28:35,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.37 vs. limit=10.0 2023-10-12 10:28:49,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1027558.0, ans=0.125 2023-10-12 10:29:14,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1027698.0, ans=0.1 2023-10-12 10:29:37,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.817e+02 1.981e+02 2.302e+02 3.285e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-12 10:29:46,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1027838.0, ans=0.0 2023-10-12 10:30:15,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=15.0 2023-10-12 10:30:19,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1027931.3333333334, ans=0.2 2023-10-12 10:30:27,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1027978.0, ans=0.125 2023-10-12 10:30:31,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1027978.0, ans=0.07 2023-10-12 10:30:35,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1028024.6666666666, ans=0.0 2023-10-12 10:30:42,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1028024.6666666666, ans=0.125 2023-10-12 10:30:48,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1028071.3333333334, ans=0.125 2023-10-12 10:31:00,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1028118.0, ans=0.125 2023-10-12 10:31:11,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1028164.6666666666, ans=0.1 2023-10-12 10:31:28,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1028258.0, ans=0.1 2023-10-12 10:31:32,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.695e+02 1.850e+02 2.045e+02 3.272e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-12 10:31:34,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1028258.0, ans=0.125 2023-10-12 10:31:35,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1028258.0, ans=0.125 2023-10-12 10:31:45,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1028304.6666666666, ans=0.0 2023-10-12 10:31:46,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1028304.6666666666, ans=0.125 2023-10-12 10:31:46,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1028304.6666666666, ans=0.125 2023-10-12 10:32:09,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=12.0 2023-10-12 10:32:16,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=15.0 2023-10-12 10:32:55,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1028584.6666666666, ans=0.025 2023-10-12 10:33:00,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1028631.3333333334, ans=0.125 2023-10-12 10:33:15,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1028678.0, ans=0.125 2023-10-12 10:33:29,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.725e+02 1.881e+02 2.102e+02 2.935e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-12 10:33:34,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1028724.6666666666, ans=0.125 2023-10-12 10:33:42,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1028771.3333333334, ans=0.125 2023-10-12 10:33:50,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1028771.3333333334, ans=0.0 2023-10-12 10:33:55,489 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.09 vs. limit=12.0 2023-10-12 10:34:17,147 INFO [train.py:1031] (3/4) Epoch 17, batch 2000, loss[loss=0.1984, simple_loss=0.2895, pruned_loss=0.0537, over 16057.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2847, pruned_loss=0.05173, over 20750233.42 frames. ], batch size: 43, lr: 2.07e-03, grad_scale: 32.0 2023-10-12 10:34:23,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1028911.3333333334, ans=0.1 2023-10-12 10:34:23,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1028911.3333333334, ans=0.1 2023-10-12 10:34:33,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1028958.0, ans=0.0 2023-10-12 10:34:35,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1028958.0, ans=0.0 2023-10-12 10:35:04,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1029051.3333333334, ans=0.0 2023-10-12 10:35:38,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.718e+02 1.871e+02 2.076e+02 2.686e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-12 10:35:53,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1029238.0, ans=0.125 2023-10-12 10:35:56,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1029238.0, ans=0.0 2023-10-12 10:36:11,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1029331.3333333334, ans=0.125 2023-10-12 10:36:34,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1029378.0, ans=0.07 2023-10-12 10:36:35,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1029378.0, ans=0.2 2023-10-12 10:36:38,494 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-10-12 10:37:05,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1029471.3333333334, ans=10.0 2023-10-12 10:37:13,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1029471.3333333334, ans=0.125 2023-10-12 10:37:21,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1029518.0, ans=0.2 2023-10-12 10:37:21,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1029518.0, ans=0.125 2023-10-12 10:37:33,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1029564.6666666666, ans=0.0 2023-10-12 10:37:37,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-10-12 10:37:50,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1029611.3333333334, ans=0.0 2023-10-12 10:37:51,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1029611.3333333334, ans=0.2 2023-10-12 10:37:53,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1029611.3333333334, ans=0.1 2023-10-12 10:37:56,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1029611.3333333334, ans=0.125 2023-10-12 10:37:56,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1029611.3333333334, ans=0.07 2023-10-12 10:38:01,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.793e+02 1.992e+02 2.278e+02 3.646e+02, threshold=3.985e+02, percent-clipped=0.0 2023-10-12 10:38:15,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1029704.6666666666, ans=0.125 2023-10-12 10:38:28,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1029751.3333333334, ans=0.125 2023-10-12 10:38:34,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1029798.0, ans=0.125 2023-10-12 10:38:40,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-10-12 10:38:50,974 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-10-12 10:38:58,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.61 vs. limit=10.0 2023-10-12 10:39:07,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1029938.0, ans=0.0 2023-10-12 10:39:11,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1029938.0, ans=0.09899494936611666 2023-10-12 10:39:15,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1029984.6666666666, ans=0.125 2023-10-12 10:39:19,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1029984.6666666666, ans=0.125 2023-10-12 10:39:52,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.745e+02 1.889e+02 2.148e+02 3.098e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 10:40:08,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.83 vs. limit=10.0 2023-10-12 10:40:41,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=15.0 2023-10-12 10:40:47,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1030358.0, ans=0.02 2023-10-12 10:41:12,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2023-10-12 10:41:24,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.11 vs. limit=22.5 2023-10-12 10:41:26,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1030544.6666666666, ans=0.0 2023-10-12 10:41:29,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.22 vs. limit=22.5 2023-10-12 10:41:37,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1030591.3333333334, ans=0.125 2023-10-12 10:41:41,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.768e+02 1.933e+02 2.176e+02 3.096e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-12 10:41:57,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1030638.0, ans=0.0 2023-10-12 10:42:05,655 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-10-12 10:42:09,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1030731.3333333334, ans=0.125 2023-10-12 10:42:11,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1030731.3333333334, ans=0.125 2023-10-12 10:42:26,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1030778.0, ans=0.125 2023-10-12 10:42:29,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1030778.0, ans=0.0 2023-10-12 10:42:40,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1030824.6666666666, ans=0.125 2023-10-12 10:42:51,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-12 10:42:51,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=8.0 2023-10-12 10:42:53,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1030918.0, ans=0.1 2023-10-12 10:42:54,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1030918.0, ans=0.0 2023-10-12 10:43:16,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1031011.3333333334, ans=0.125 2023-10-12 10:43:17,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1031011.3333333334, ans=0.125 2023-10-12 10:43:20,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1031011.3333333334, ans=0.0 2023-10-12 10:43:30,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.772e+02 1.905e+02 2.104e+02 2.979e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 10:43:34,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1031058.0, ans=0.125 2023-10-12 10:43:35,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1031058.0, ans=0.0 2023-10-12 10:43:36,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1031104.6666666666, ans=0.125 2023-10-12 10:43:38,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.43 vs. limit=15.0 2023-10-12 10:43:41,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.68 vs. limit=10.0 2023-10-12 10:44:09,693 INFO [train.py:1031] (3/4) Epoch 17, batch 2500, loss[loss=0.1983, simple_loss=0.2833, pruned_loss=0.05666, over 16567.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2851, pruned_loss=0.052, over 23416538.71 frames. ], batch size: 66, lr: 2.07e-03, grad_scale: 32.0 2023-10-12 10:44:18,402 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-10-12 10:44:39,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1031338.0, ans=0.0 2023-10-12 10:44:40,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-10-12 10:44:42,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1031384.6666666666, ans=0.0 2023-10-12 10:44:42,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1031384.6666666666, ans=0.125 2023-10-12 10:45:04,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1031478.0, ans=0.2 2023-10-12 10:45:09,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1031478.0, ans=0.2 2023-10-12 10:45:12,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1031524.6666666666, ans=0.125 2023-10-12 10:45:13,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1031524.6666666666, ans=0.2 2023-10-12 10:45:14,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.690e+02 1.830e+02 2.025e+02 2.731e+02, threshold=3.660e+02, percent-clipped=0.0 2023-10-12 10:45:21,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=15.0 2023-10-12 10:45:30,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031571.3333333334, ans=0.1 2023-10-12 10:45:34,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1031618.0, ans=0.125 2023-10-12 10:45:50,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1031664.6666666666, ans=0.035 2023-10-12 10:46:00,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1031711.3333333334, ans=0.1 2023-10-12 10:46:20,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=22.5 2023-10-12 10:46:34,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031851.3333333334, ans=0.1 2023-10-12 10:46:43,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1031898.0, ans=0.2 2023-10-12 10:46:49,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1031898.0, ans=0.125 2023-10-12 10:47:00,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1031944.6666666666, ans=0.0 2023-10-12 10:47:05,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.51 vs. limit=6.0 2023-10-12 10:47:05,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.714e+02 1.937e+02 2.131e+02 2.812e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 10:47:08,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1031991.3333333334, ans=0.125 2023-10-12 10:47:11,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032038.0, ans=0.1 2023-10-12 10:47:32,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1032084.6666666666, ans=0.125 2023-10-12 10:47:38,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1032131.3333333334, ans=0.125 2023-10-12 10:47:42,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1032131.3333333334, ans=0.1 2023-10-12 10:47:55,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1032178.0, ans=0.04949747468305833 2023-10-12 10:48:12,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.96 vs. limit=10.0 2023-10-12 10:48:16,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1032271.3333333334, ans=0.0 2023-10-12 10:48:16,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-10-12 10:48:42,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1032364.6666666666, ans=12.0 2023-10-12 10:48:46,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1032411.3333333334, ans=0.125 2023-10-12 10:49:01,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1032458.0, ans=0.125 2023-10-12 10:49:02,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1032458.0, ans=0.0 2023-10-12 10:49:03,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.766e+02 1.884e+02 2.185e+02 3.320e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 10:49:09,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1032458.0, ans=0.0 2023-10-12 10:49:12,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1032504.6666666666, ans=0.1 2023-10-12 10:49:17,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1032504.6666666666, ans=0.125 2023-10-12 10:49:19,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1032504.6666666666, ans=0.125 2023-10-12 10:49:23,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=15.0 2023-10-12 10:49:32,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1032598.0, ans=0.2 2023-10-12 10:49:57,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1032691.3333333334, ans=0.2 2023-10-12 10:49:58,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1032691.3333333334, ans=0.1 2023-10-12 10:50:46,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1032878.0, ans=0.1 2023-10-12 10:50:57,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1032924.6666666666, ans=0.07 2023-10-12 10:50:59,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.704e+02 1.886e+02 2.109e+02 2.690e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 10:51:19,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032971.3333333334, ans=0.1 2023-10-12 10:51:25,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1033018.0, ans=0.125 2023-10-12 10:51:27,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1033018.0, ans=0.1 2023-10-12 10:51:48,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1033111.3333333334, ans=0.2 2023-10-12 10:52:01,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1033158.0, ans=0.0 2023-10-12 10:52:11,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1033158.0, ans=0.1 2023-10-12 10:52:11,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1033204.6666666666, ans=0.125 2023-10-12 10:52:15,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-10-12 10:52:35,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1033298.0, ans=0.125 2023-10-12 10:52:46,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-10-12 10:52:55,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-10-12 10:52:59,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2023-10-12 10:52:59,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.761e+02 1.991e+02 2.289e+02 3.056e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-12 10:53:40,309 INFO [train.py:1031] (3/4) Epoch 17, batch 3000, loss[loss=0.1725, simple_loss=0.2694, pruned_loss=0.03781, over 16841.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2844, pruned_loss=0.05204, over 25486777.77 frames. ], batch size: 146, lr: 2.07e-03, grad_scale: 32.0 2023-10-12 10:53:40,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.22 vs. limit=15.0 2023-10-12 10:53:50,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1033624.6666666666, ans=0.2 2023-10-12 10:53:53,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1033624.6666666666, ans=0.2 2023-10-12 10:53:55,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1033624.6666666666, ans=0.125 2023-10-12 10:54:10,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1033671.3333333334, ans=0.125 2023-10-12 10:54:12,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1033718.0, ans=0.0 2023-10-12 10:54:15,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1033718.0, ans=0.0 2023-10-12 10:54:20,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1033718.0, ans=0.1 2023-10-12 10:54:21,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1033718.0, ans=0.0 2023-10-12 10:54:31,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1033764.6666666666, ans=0.0 2023-10-12 10:54:32,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1033764.6666666666, ans=0.2 2023-10-12 10:54:43,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1033811.3333333334, ans=0.125 2023-10-12 10:54:53,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.691e+02 1.872e+02 2.145e+02 3.353e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-12 10:55:05,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1033904.6666666666, ans=0.125 2023-10-12 10:55:12,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1033951.3333333334, ans=0.2 2023-10-12 10:55:19,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1033951.3333333334, ans=0.2 2023-10-12 10:55:22,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-12 10:55:37,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034044.6666666666, ans=0.1 2023-10-12 10:56:02,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1034138.0, ans=0.1 2023-10-12 10:56:12,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1034184.6666666666, ans=0.0 2023-10-12 10:56:14,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1034184.6666666666, ans=0.125 2023-10-12 10:56:36,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1034278.0, ans=0.2 2023-10-12 10:56:49,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.763e+02 1.878e+02 2.177e+02 2.913e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-12 10:56:57,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1034371.3333333334, ans=0.1 2023-10-12 10:57:34,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1034511.3333333334, ans=0.0 2023-10-12 10:57:37,495 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:57:45,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1034558.0, ans=0.125 2023-10-12 10:58:16,039 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.25 vs. limit=10.0 2023-10-12 10:58:24,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1034698.0, ans=15.0 2023-10-12 10:58:25,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1034698.0, ans=0.125 2023-10-12 10:58:26,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.69 vs. limit=15.0 2023-10-12 10:58:28,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1034698.0, ans=0.0 2023-10-12 10:58:29,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1034698.0, ans=0.0 2023-10-12 10:58:33,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1034744.6666666666, ans=0.0 2023-10-12 10:58:37,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1034744.6666666666, ans=0.125 2023-10-12 10:58:54,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.784e+02 2.025e+02 2.309e+02 3.954e+02, threshold=4.051e+02, percent-clipped=1.0 2023-10-12 10:59:00,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1034838.0, ans=0.125 2023-10-12 10:59:04,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1034838.0, ans=0.0 2023-10-12 10:59:29,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1034931.3333333334, ans=0.125 2023-10-12 10:59:49,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1035024.6666666666, ans=0.125 2023-10-12 10:59:54,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1035071.3333333334, ans=0.125 2023-10-12 11:00:01,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1035071.3333333334, ans=0.0 2023-10-12 11:00:31,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1035211.3333333334, ans=0.0 2023-10-12 11:00:50,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.759e+02 1.884e+02 2.050e+02 3.041e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-12 11:00:58,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1035304.6666666666, ans=0.125 2023-10-12 11:01:02,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1035304.6666666666, ans=0.0 2023-10-12 11:01:37,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1035444.6666666666, ans=0.125 2023-10-12 11:01:56,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1035538.0, ans=0.1 2023-10-12 11:02:23,424 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:02:27,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.75 vs. limit=15.0 2023-10-12 11:02:43,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-12 11:02:43,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.708e+02 1.895e+02 2.107e+02 2.728e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-12 11:02:47,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1035724.6666666666, ans=10.0 2023-10-12 11:03:14,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.73 vs. limit=10.0 2023-10-12 11:03:21,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1035864.6666666666, ans=0.125 2023-10-12 11:03:24,158 INFO [train.py:1031] (3/4) Epoch 17, batch 3500, loss[loss=0.2006, simple_loss=0.2932, pruned_loss=0.05403, over 16940.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2843, pruned_loss=0.05196, over 27124349.36 frames. ], batch size: 93, lr: 2.07e-03, grad_scale: 16.0 2023-10-12 11:03:34,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1035958.0, ans=0.0 2023-10-12 11:03:35,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.15 vs. limit=15.0 2023-10-12 11:03:47,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036004.6666666666, ans=0.1 2023-10-12 11:03:49,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1036004.6666666666, ans=0.125 2023-10-12 11:04:01,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1036051.3333333334, ans=0.125 2023-10-12 11:04:09,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1036098.0, ans=0.0 2023-10-12 11:04:25,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1036144.6666666666, ans=0.1 2023-10-12 11:04:34,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1036191.3333333334, ans=10.0 2023-10-12 11:04:34,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.764e+02 2.002e+02 2.247e+02 3.370e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-12 11:04:35,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.95 vs. limit=15.0 2023-10-12 11:04:46,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1036238.0, ans=0.0 2023-10-12 11:04:46,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1036238.0, ans=0.1 2023-10-12 11:04:53,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1036238.0, ans=0.125 2023-10-12 11:05:01,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-12 11:05:30,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1036424.6666666666, ans=0.04949747468305833 2023-10-12 11:05:30,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1036424.6666666666, ans=15.0 2023-10-12 11:05:36,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036424.6666666666, ans=0.1 2023-10-12 11:05:52,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1036471.3333333334, ans=0.125 2023-10-12 11:05:53,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1036518.0, ans=0.125 2023-10-12 11:06:14,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.68 vs. limit=10.0 2023-10-12 11:06:38,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-10-12 11:06:44,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1036658.0, ans=0.1 2023-10-12 11:06:48,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.762e+02 1.985e+02 2.223e+02 3.035e+02, threshold=3.970e+02, percent-clipped=0.0 2023-10-12 11:07:24,935 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.60 vs. limit=22.5 2023-10-12 11:07:39,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036891.3333333334, ans=0.1 2023-10-12 11:07:46,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1036891.3333333334, ans=0.2 2023-10-12 11:08:06,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036938.0, ans=0.1 2023-10-12 11:08:09,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1036984.6666666666, ans=0.0 2023-10-12 11:08:21,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1037031.3333333334, ans=0.0 2023-10-12 11:08:21,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.80 vs. limit=15.0 2023-10-12 11:08:23,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1037031.3333333334, ans=0.1 2023-10-12 11:08:23,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1037031.3333333334, ans=0.125 2023-10-12 11:08:31,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1037078.0, ans=0.125 2023-10-12 11:08:35,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-10-12 11:08:48,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.673e+02 1.803e+02 1.953e+02 3.169e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-12 11:08:56,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1037171.3333333334, ans=0.0 2023-10-12 11:09:38,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=12.0 2023-10-12 11:09:40,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1037358.0, ans=0.04949747468305833 2023-10-12 11:09:43,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.89 vs. limit=15.0 2023-10-12 11:10:19,093 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:10:43,106 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.656e+02 1.859e+02 2.018e+02 3.077e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 11:10:55,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=22.5 2023-10-12 11:10:58,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.13 vs. limit=22.5 2023-10-12 11:11:08,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1037731.3333333334, ans=0.125 2023-10-12 11:12:23,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.18 vs. limit=15.0 2023-10-12 11:12:31,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.656e+02 1.834e+02 2.040e+02 2.876e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-12 11:12:49,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1038151.3333333334, ans=0.1 2023-10-12 11:13:07,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1038198.0, ans=0.125 2023-10-12 11:13:09,828 INFO [train.py:1031] (3/4) Epoch 17, batch 4000, loss[loss=0.1972, simple_loss=0.2901, pruned_loss=0.05216, over 16926.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2842, pruned_loss=0.05214, over 28390018.85 frames. ], batch size: 82, lr: 2.06e-03, grad_scale: 32.0 2023-10-12 11:13:11,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1038244.6666666666, ans=0.125 2023-10-12 11:13:11,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1038244.6666666666, ans=0.1 2023-10-12 11:13:14,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1038244.6666666666, ans=0.1 2023-10-12 11:13:16,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1038244.6666666666, ans=0.125 2023-10-12 11:13:35,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-12 11:13:59,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-10-12 11:14:05,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-10-12 11:14:26,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.827e+02 2.015e+02 2.319e+02 3.129e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-12 11:14:34,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1038571.3333333334, ans=0.125 2023-10-12 11:14:48,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1038618.0, ans=0.0 2023-10-12 11:15:10,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-10-12 11:15:15,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1038758.0, ans=0.0 2023-10-12 11:15:32,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1038804.6666666666, ans=0.1 2023-10-12 11:15:53,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1038898.0, ans=0.125 2023-10-12 11:16:14,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1038991.3333333334, ans=0.0 2023-10-12 11:16:14,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1038991.3333333334, ans=0.1 2023-10-12 11:16:16,013 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:16:23,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.747e+02 1.921e+02 2.112e+02 3.895e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-12 11:16:38,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=15.0 2023-10-12 11:17:22,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-12 11:17:37,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1039271.3333333334, ans=0.07 2023-10-12 11:18:03,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1039364.6666666666, ans=0.125 2023-10-12 11:18:24,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.699e+02 1.826e+02 2.016e+02 3.213e+02, threshold=3.653e+02, percent-clipped=0.0 2023-10-12 11:18:25,381 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:18:35,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-12 11:18:40,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1039551.3333333334, ans=0.0 2023-10-12 11:18:40,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.52 vs. limit=22.5 2023-10-12 11:18:58,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1039644.6666666666, ans=0.125 2023-10-12 11:19:23,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1039738.0, ans=0.1 2023-10-12 11:19:29,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1039738.0, ans=0.125 2023-10-12 11:19:30,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-10-12 11:19:39,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1039784.6666666666, ans=0.125 2023-10-12 11:19:43,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1039831.3333333334, ans=0.125 2023-10-12 11:20:08,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-12 11:20:11,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.905e+02 2.123e+02 2.459e+02 3.376e+02, threshold=4.245e+02, percent-clipped=0.0 2023-10-12 11:20:12,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1039924.6666666666, ans=0.0 2023-10-12 11:20:17,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1039971.3333333334, ans=0.0 2023-10-12 11:20:18,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1039971.3333333334, ans=0.125 2023-10-12 11:20:29,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1040018.0, ans=0.0 2023-10-12 11:20:45,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1040064.6666666666, ans=0.0 2023-10-12 11:20:47,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1040064.6666666666, ans=0.1 2023-10-12 11:20:49,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1040111.3333333334, ans=0.125 2023-10-12 11:21:00,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1040111.3333333334, ans=0.2 2023-10-12 11:21:08,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1040158.0, ans=0.0 2023-10-12 11:21:19,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1040204.6666666666, ans=0.125 2023-10-12 11:21:25,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-10-12 11:21:33,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1040251.3333333334, ans=0.04949747468305833 2023-10-12 11:21:46,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.54 vs. limit=12.0 2023-10-12 11:21:54,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1040344.6666666666, ans=0.2 2023-10-12 11:22:10,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1040391.3333333334, ans=0.125 2023-10-12 11:22:11,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1040391.3333333334, ans=0.5 2023-10-12 11:22:16,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.889e+02 2.051e+02 2.231e+02 3.377e+02, threshold=4.102e+02, percent-clipped=0.0 2023-10-12 11:22:33,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1040484.6666666666, ans=0.125 2023-10-12 11:22:34,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1040484.6666666666, ans=0.125 2023-10-12 11:22:44,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.98 vs. limit=15.0 2023-10-12 11:22:45,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-10-12 11:22:51,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.18 vs. limit=15.0 2023-10-12 11:22:51,858 INFO [train.py:1031] (3/4) Epoch 17, batch 4500, loss[loss=0.196, simple_loss=0.2892, pruned_loss=0.05136, over 16901.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2843, pruned_loss=0.05181, over 29362012.51 frames. ], batch size: 72, lr: 2.06e-03, grad_scale: 32.0 2023-10-12 11:22:52,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1040578.0, ans=0.0 2023-10-12 11:22:58,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1040578.0, ans=0.0 2023-10-12 11:23:11,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1040624.6666666666, ans=0.0 2023-10-12 11:23:26,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1040718.0, ans=0.1 2023-10-12 11:23:27,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1040718.0, ans=0.2 2023-10-12 11:23:36,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1040764.6666666666, ans=0.0 2023-10-12 11:23:50,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1040811.3333333334, ans=0.0 2023-10-12 11:24:01,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.698e+02 1.862e+02 2.005e+02 3.191e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-12 11:24:08,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1040904.6666666666, ans=0.2 2023-10-12 11:24:19,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1040951.3333333334, ans=0.125 2023-10-12 11:24:30,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1040998.0, ans=0.0 2023-10-12 11:24:59,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1041138.0, ans=0.125 2023-10-12 11:25:08,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041138.0, ans=0.1 2023-10-12 11:25:15,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1041184.6666666666, ans=0.0 2023-10-12 11:25:16,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041184.6666666666, ans=0.1 2023-10-12 11:25:17,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1041184.6666666666, ans=0.125 2023-10-12 11:25:21,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-12 11:25:32,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1041278.0, ans=0.125 2023-10-12 11:25:34,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1041278.0, ans=0.125 2023-10-12 11:25:48,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.771e+02 1.978e+02 2.267e+02 3.246e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-12 11:26:01,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041371.3333333334, ans=0.1 2023-10-12 11:26:01,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-12 11:26:03,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1041418.0, ans=0.125 2023-10-12 11:26:09,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-10-12 11:26:34,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1041558.0, ans=0.125 2023-10-12 11:26:38,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.33 vs. limit=22.5 2023-10-12 11:26:55,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1041651.3333333334, ans=0.125 2023-10-12 11:27:01,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.78 vs. limit=10.0 2023-10-12 11:27:13,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1041698.0, ans=0.0 2023-10-12 11:27:15,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2023-10-12 11:27:20,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1041744.6666666666, ans=0.2 2023-10-12 11:27:30,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1041791.3333333334, ans=0.125 2023-10-12 11:27:31,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1041791.3333333334, ans=0.125 2023-10-12 11:27:36,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.770e+02 1.956e+02 2.280e+02 3.716e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-12 11:27:37,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.54 vs. limit=10.0 2023-10-12 11:27:50,334 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.94 vs. limit=15.0 2023-10-12 11:27:55,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1041884.6666666666, ans=0.0 2023-10-12 11:28:10,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1041978.0, ans=0.125 2023-10-12 11:28:17,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1042024.6666666666, ans=0.125 2023-10-12 11:28:17,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1042024.6666666666, ans=0.125 2023-10-12 11:28:29,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1042024.6666666666, ans=0.125 2023-10-12 11:28:41,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1042071.3333333334, ans=0.2 2023-10-12 11:28:42,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-12 11:29:17,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1042211.3333333334, ans=0.1 2023-10-12 11:29:21,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1042258.0, ans=0.0 2023-10-12 11:29:29,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.687e+02 1.853e+02 2.066e+02 3.511e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-12 11:29:45,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=22.5 2023-10-12 11:29:46,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1042351.3333333334, ans=0.125 2023-10-12 11:30:09,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-10-12 11:30:13,978 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:30:28,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1042538.0, ans=0.0 2023-10-12 11:30:41,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1042584.6666666666, ans=0.0 2023-10-12 11:31:10,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1042678.0, ans=0.125 2023-10-12 11:31:22,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.710e+02 1.844e+02 2.083e+02 2.836e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-12 11:31:30,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1042771.3333333334, ans=0.025 2023-10-12 11:31:34,951 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.266e-01 2023-10-12 11:31:40,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1042818.0, ans=0.0 2023-10-12 11:31:58,893 INFO [train.py:1031] (3/4) Epoch 17, batch 5000, loss[loss=0.2402, simple_loss=0.316, pruned_loss=0.08219, over 16057.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2842, pruned_loss=0.05188, over 30148131.36 frames. ], batch size: 296, lr: 2.06e-03, grad_scale: 32.0 2023-10-12 11:32:27,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.36 vs. limit=15.0 2023-10-12 11:32:29,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1043004.6666666666, ans=0.125 2023-10-12 11:32:53,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1043144.6666666666, ans=0.0 2023-10-12 11:32:59,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-10-12 11:33:13,670 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.790e+02 1.935e+02 2.178e+02 3.079e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 11:33:31,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1043284.6666666666, ans=0.0 2023-10-12 11:33:47,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1043331.3333333334, ans=0.125 2023-10-12 11:33:47,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-10-12 11:34:22,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1043471.3333333334, ans=0.0 2023-10-12 11:34:27,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1043518.0, ans=0.0 2023-10-12 11:34:27,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.77 vs. limit=15.0 2023-10-12 11:34:30,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1043518.0, ans=0.125 2023-10-12 11:34:31,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1043518.0, ans=0.04949747468305833 2023-10-12 11:34:55,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1043611.3333333334, ans=0.1 2023-10-12 11:34:57,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1043611.3333333334, ans=0.125 2023-10-12 11:35:10,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1043658.0, ans=0.05 2023-10-12 11:35:11,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.704e+02 1.934e+02 2.176e+02 3.621e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-12 11:35:27,727 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:35:33,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1043798.0, ans=0.125 2023-10-12 11:35:45,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1043844.6666666666, ans=0.2 2023-10-12 11:35:51,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1043844.6666666666, ans=0.1 2023-10-12 11:35:51,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1043844.6666666666, ans=0.07 2023-10-12 11:36:02,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1043891.3333333334, ans=0.09899494936611666 2023-10-12 11:36:35,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1044078.0, ans=0.125 2023-10-12 11:36:37,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1044078.0, ans=0.1 2023-10-12 11:36:39,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1044078.0, ans=0.0 2023-10-12 11:36:47,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1044078.0, ans=0.2 2023-10-12 11:36:57,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-10-12 11:37:01,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.727e+02 1.868e+02 2.066e+02 2.891e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 11:37:13,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1044218.0, ans=0.0 2023-10-12 11:37:34,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1044264.6666666666, ans=0.0 2023-10-12 11:38:33,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1044544.6666666666, ans=0.125 2023-10-12 11:38:50,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1044591.3333333334, ans=0.2 2023-10-12 11:38:54,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1044591.3333333334, ans=0.0 2023-10-12 11:38:56,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.661e+02 1.857e+02 2.055e+02 2.845e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 11:38:56,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1044638.0, ans=0.125 2023-10-12 11:39:09,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1044684.6666666666, ans=0.0 2023-10-12 11:39:10,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1044684.6666666666, ans=0.0 2023-10-12 11:39:13,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1044684.6666666666, ans=0.2 2023-10-12 11:39:15,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1044731.3333333334, ans=0.2 2023-10-12 11:39:15,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044731.3333333334, ans=0.1 2023-10-12 11:39:33,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1044778.0, ans=0.125 2023-10-12 11:39:37,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1044824.6666666666, ans=0.1 2023-10-12 11:39:53,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=15.0 2023-10-12 11:40:42,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.700e+02 1.905e+02 2.254e+02 3.297e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-12 11:40:44,016 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:40:47,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1045104.6666666666, ans=0.1 2023-10-12 11:40:52,536 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:41:17,202 INFO [train.py:1031] (3/4) Epoch 17, batch 5500, loss[loss=0.1699, simple_loss=0.2672, pruned_loss=0.0363, over 16950.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.284, pruned_loss=0.05174, over 30728582.83 frames. ], batch size: 82, lr: 2.06e-03, grad_scale: 16.0 2023-10-12 11:41:20,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1045244.6666666666, ans=0.125 2023-10-12 11:41:34,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.78 vs. limit=15.0 2023-10-12 11:41:45,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045338.0, ans=0.1 2023-10-12 11:41:50,304 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.12 vs. limit=10.0 2023-10-12 11:41:53,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=12.0 2023-10-12 11:41:54,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1045384.6666666666, ans=0.2 2023-10-12 11:41:59,189 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-10-12 11:42:00,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1045431.3333333334, ans=0.125 2023-10-12 11:42:01,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1045431.3333333334, ans=0.04949747468305833 2023-10-12 11:42:10,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=6.0 2023-10-12 11:42:30,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.60 vs. limit=15.0 2023-10-12 11:42:34,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.668e+02 1.785e+02 1.937e+02 2.713e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-12 11:42:34,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1045571.3333333334, ans=0.0 2023-10-12 11:42:35,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1045571.3333333334, ans=0.125 2023-10-12 11:42:57,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045664.6666666666, ans=0.1 2023-10-12 11:43:15,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1045758.0, ans=0.125 2023-10-12 11:43:24,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1045758.0, ans=0.0 2023-10-12 11:43:28,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1045804.6666666666, ans=0.0 2023-10-12 11:43:28,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1045804.6666666666, ans=0.0 2023-10-12 11:43:29,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1045804.6666666666, ans=0.125 2023-10-12 11:43:41,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1045851.3333333334, ans=0.125 2023-10-12 11:43:42,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1045851.3333333334, ans=0.125 2023-10-12 11:43:42,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=22.5 2023-10-12 11:43:43,348 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:44:03,689 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.81 vs. limit=15.0 2023-10-12 11:44:10,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1045991.3333333334, ans=0.2 2023-10-12 11:44:22,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.722e+02 1.901e+02 2.122e+02 2.617e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-12 11:44:24,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1046038.0, ans=0.125 2023-10-12 11:44:25,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1046038.0, ans=0.1 2023-10-12 11:44:36,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1046084.6666666666, ans=0.0 2023-10-12 11:44:41,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=22.5 2023-10-12 11:44:42,468 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:44:45,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1046131.3333333334, ans=0.04949747468305833 2023-10-12 11:44:59,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.57 vs. limit=10.0 2023-10-12 11:45:15,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1046224.6666666666, ans=0.2 2023-10-12 11:45:27,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1046271.3333333334, ans=0.125 2023-10-12 11:45:28,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1046271.3333333334, ans=0.0 2023-10-12 11:45:28,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1046271.3333333334, ans=0.0 2023-10-12 11:45:56,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1046411.3333333334, ans=0.0 2023-10-12 11:45:57,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1046411.3333333334, ans=0.2 2023-10-12 11:46:01,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1046458.0, ans=0.125 2023-10-12 11:46:05,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1046458.0, ans=0.1 2023-10-12 11:46:14,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.768e+02 1.984e+02 2.257e+02 2.986e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-12 11:46:24,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1046551.3333333334, ans=0.0 2023-10-12 11:46:32,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046551.3333333334, ans=0.1 2023-10-12 11:46:43,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1046598.0, ans=0.0 2023-10-12 11:46:43,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1046598.0, ans=0.125 2023-10-12 11:47:02,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1046691.3333333334, ans=0.125 2023-10-12 11:47:08,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.65 vs. limit=22.5 2023-10-12 11:47:23,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1046784.6666666666, ans=0.125 2023-10-12 11:47:28,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1046784.6666666666, ans=0.2 2023-10-12 11:47:30,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1046784.6666666666, ans=0.95 2023-10-12 11:47:45,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1046878.0, ans=0.125 2023-10-12 11:47:51,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1046878.0, ans=0.125 2023-10-12 11:48:08,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.810e+02 2.041e+02 2.357e+02 3.009e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-12 11:48:15,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1046971.3333333334, ans=0.0 2023-10-12 11:48:20,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1047018.0, ans=0.125 2023-10-12 11:48:21,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1047018.0, ans=0.125 2023-10-12 11:48:51,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047111.3333333334, ans=0.1 2023-10-12 11:48:52,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1047111.3333333334, ans=0.2 2023-10-12 11:48:56,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2023-10-12 11:49:07,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=15.0 2023-10-12 11:49:26,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1047251.3333333334, ans=0.1 2023-10-12 11:49:27,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1047251.3333333334, ans=0.125 2023-10-12 11:49:30,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1047251.3333333334, ans=0.1 2023-10-12 11:49:46,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2023-10-12 11:49:57,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=15.0 2023-10-12 11:50:03,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1047391.3333333334, ans=0.125 2023-10-12 11:50:07,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.789e+02 1.936e+02 2.151e+02 3.140e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-12 11:50:09,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1047438.0, ans=0.125 2023-10-12 11:50:13,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2023-10-12 11:50:14,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.37 vs. limit=15.0 2023-10-12 11:50:26,655 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:50:39,690 INFO [train.py:1031] (3/4) Epoch 17, batch 6000, loss[loss=0.1869, simple_loss=0.2887, pruned_loss=0.04262, over 16823.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2843, pruned_loss=0.05196, over 31193913.52 frames. ], batch size: 98, lr: 2.05e-03, grad_scale: 32.0 2023-10-12 11:50:39,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1047578.0, ans=0.0 2023-10-12 11:50:54,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1047624.6666666666, ans=0.2 2023-10-12 11:50:56,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-12 11:51:12,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1047718.0, ans=0.125 2023-10-12 11:51:17,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1047718.0, ans=0.0 2023-10-12 11:51:19,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1047718.0, ans=0.0 2023-10-12 11:51:22,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1047718.0, ans=0.125 2023-10-12 11:51:31,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.52 vs. limit=15.0 2023-10-12 11:51:40,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2023-10-12 11:51:47,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1047811.3333333334, ans=0.125 2023-10-12 11:51:51,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1047858.0, ans=0.0 2023-10-12 11:51:53,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1047858.0, ans=0.125 2023-10-12 11:51:56,614 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=15.0 2023-10-12 11:52:00,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.730e+02 1.930e+02 2.198e+02 2.972e+02, threshold=3.861e+02, percent-clipped=0.0 2023-10-12 11:52:23,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1047998.0, ans=0.0 2023-10-12 11:52:41,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1048044.6666666666, ans=0.125 2023-10-12 11:52:56,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1048138.0, ans=0.0 2023-10-12 11:53:27,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1048278.0, ans=0.125 2023-10-12 11:53:28,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.89 vs. limit=15.0 2023-10-12 11:53:31,746 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:53:36,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1048324.6666666666, ans=0.125 2023-10-12 11:53:48,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.745e+02 1.902e+02 2.152e+02 3.327e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-12 11:53:53,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1048371.3333333334, ans=0.05 2023-10-12 11:54:04,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.09 vs. limit=10.0 2023-10-12 11:54:15,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1048464.6666666666, ans=0.125 2023-10-12 11:54:20,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1048511.3333333334, ans=0.1 2023-10-12 11:54:27,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1048511.3333333334, ans=0.125 2023-10-12 11:54:58,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.21 vs. limit=22.5 2023-10-12 11:55:11,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1048698.0, ans=0.125 2023-10-12 11:55:35,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1048791.3333333333, ans=0.0 2023-10-12 11:55:35,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1048791.3333333333, ans=0.125 2023-10-12 11:55:36,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1048791.3333333333, ans=0.1 2023-10-12 11:55:40,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.786e+02 1.934e+02 2.131e+02 2.678e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 11:55:45,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1048838.0, ans=0.125 2023-10-12 11:55:53,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-10-12 11:55:56,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1048884.6666666667, ans=0.0 2023-10-12 11:56:03,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1048931.3333333333, ans=0.125 2023-10-12 11:56:31,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1049024.6666666667, ans=0.0 2023-10-12 11:56:50,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1049118.0, ans=0.125 2023-10-12 11:56:58,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1049118.0, ans=0.07 2023-10-12 11:57:13,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1049211.3333333333, ans=0.125 2023-10-12 11:57:25,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1049258.0, ans=0.1 2023-10-12 11:57:42,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.763e+02 1.986e+02 2.211e+02 3.011e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-12 11:57:45,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049304.6666666667, ans=0.1 2023-10-12 11:57:45,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1049304.6666666667, ans=0.125 2023-10-12 11:57:56,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1049351.3333333333, ans=0.125 2023-10-12 11:57:59,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2023-10-12 11:58:19,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1049444.6666666667, ans=0.0 2023-10-12 11:58:23,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1049444.6666666667, ans=0.0 2023-10-12 11:58:37,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1049538.0, ans=0.125 2023-10-12 11:58:39,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1049538.0, ans=0.2 2023-10-12 11:59:07,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1049678.0, ans=0.125 2023-10-12 11:59:10,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-12 11:59:30,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.662e+02 1.895e+02 2.171e+02 3.662e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-12 11:59:31,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1049771.3333333333, ans=0.125 2023-10-12 11:59:34,009 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=15.0 2023-10-12 11:59:34,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1049771.3333333333, ans=0.0 2023-10-12 11:59:34,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1049771.3333333333, ans=0.04949747468305833 2023-10-12 11:59:51,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1049818.0, ans=0.125 2023-10-12 11:59:53,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-12 11:59:55,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1049864.6666666667, ans=0.125 2023-10-12 12:00:02,999 INFO [train.py:1031] (3/4) Epoch 17, batch 6500, loss[loss=0.2016, simple_loss=0.2581, pruned_loss=0.07258, over 12248.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2846, pruned_loss=0.05201, over 31550369.88 frames. ], batch size: 440, lr: 2.05e-03, grad_scale: 16.0 2023-10-12 12:00:14,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1049911.3333333333, ans=0.0 2023-10-12 12:00:14,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1049911.3333333333, ans=0.0 2023-10-12 12:00:18,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1049958.0, ans=0.0 2023-10-12 12:00:20,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1049958.0, ans=0.1 2023-10-12 12:00:20,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1049958.0, ans=0.05 2023-10-12 12:00:25,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1049958.0, ans=0.1 2023-10-12 12:00:29,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1050004.6666666667, ans=0.0 2023-10-12 12:01:02,255 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=8.0 2023-10-12 12:01:20,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1050144.6666666667, ans=0.125 2023-10-12 12:01:21,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1050144.6666666667, ans=0.125 2023-10-12 12:01:36,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.751e+02 1.889e+02 2.086e+02 2.685e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 12:01:38,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1050238.0, ans=0.0 2023-10-12 12:01:48,142 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:02:00,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.27 vs. limit=10.0 2023-10-12 12:02:09,606 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:02:13,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1050378.0, ans=0.125 2023-10-12 12:02:16,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1050378.0, ans=0.2 2023-10-12 12:02:17,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1050424.6666666667, ans=0.1 2023-10-12 12:02:22,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1050424.6666666667, ans=0.0 2023-10-12 12:02:30,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1050471.3333333333, ans=0.2 2023-10-12 12:02:35,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050471.3333333333, ans=0.1 2023-10-12 12:02:37,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1050471.3333333333, ans=0.0 2023-10-12 12:02:43,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050518.0, ans=0.1 2023-10-12 12:02:55,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1050564.6666666667, ans=0.125 2023-10-12 12:03:19,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1050658.0, ans=0.0 2023-10-12 12:03:21,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1050658.0, ans=0.0 2023-10-12 12:03:21,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1050658.0, ans=0.125 2023-10-12 12:03:25,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.782e+02 1.944e+02 2.207e+02 3.492e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 12:03:29,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-10-12 12:03:31,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1050704.6666666667, ans=0.125 2023-10-12 12:03:42,058 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.63 vs. limit=22.5 2023-10-12 12:03:51,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1050798.0, ans=0.125 2023-10-12 12:04:17,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=22.5 2023-10-12 12:04:45,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1051031.3333333333, ans=0.2 2023-10-12 12:04:52,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1051078.0, ans=0.125 2023-10-12 12:04:56,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1051078.0, ans=0.125 2023-10-12 12:04:58,902 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2023-10-12 12:04:59,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051078.0, ans=0.1 2023-10-12 12:05:17,932 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.618e+02 1.873e+02 2.123e+02 3.198e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-12 12:05:18,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1051171.3333333333, ans=0.2 2023-10-12 12:05:30,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1051218.0, ans=0.0 2023-10-12 12:05:31,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1051218.0, ans=0.125 2023-10-12 12:05:56,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-10-12 12:06:44,143 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:06:50,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051498.0, ans=0.1 2023-10-12 12:07:14,374 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:07:18,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051591.3333333333, ans=0.1 2023-10-12 12:07:25,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1051591.3333333333, ans=0.125 2023-10-12 12:07:36,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.673e+02 1.829e+02 2.080e+02 3.657e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-12 12:07:46,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-10-12 12:07:49,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1051684.6666666667, ans=0.125 2023-10-12 12:08:10,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1051778.0, ans=0.125 2023-10-12 12:08:11,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1051778.0, ans=0.05 2023-10-12 12:08:16,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1051778.0, ans=0.09899494936611666 2023-10-12 12:08:30,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-10-12 12:08:32,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1051871.3333333333, ans=0.0 2023-10-12 12:08:38,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=22.5 2023-10-12 12:08:45,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1051918.0, ans=0.0 2023-10-12 12:08:48,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1051918.0, ans=0.125 2023-10-12 12:08:58,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1051964.6666666667, ans=0.125 2023-10-12 12:09:22,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1052058.0, ans=0.125 2023-10-12 12:09:28,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.872e+02 2.247e+02 2.696e+02 3.848e+02, threshold=4.494e+02, percent-clipped=1.0 2023-10-12 12:09:44,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1052151.3333333333, ans=0.0 2023-10-12 12:09:49,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1052198.0, ans=0.125 2023-10-12 12:09:54,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1052198.0, ans=0.125 2023-10-12 12:09:56,584 INFO [train.py:1031] (3/4) Epoch 17, batch 7000, loss[loss=0.1935, simple_loss=0.2896, pruned_loss=0.04872, over 16591.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2853, pruned_loss=0.05189, over 31850117.55 frames. ], batch size: 66, lr: 2.05e-03, grad_scale: 16.0 2023-10-12 12:10:01,262 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-10-12 12:10:04,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-12 12:10:20,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=12.0 2023-10-12 12:10:44,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1052431.3333333333, ans=0.0 2023-10-12 12:10:57,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1052478.0, ans=0.0 2023-10-12 12:11:08,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052524.6666666667, ans=0.1 2023-10-12 12:11:17,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1052571.3333333333, ans=0.125 2023-10-12 12:11:19,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.704e+02 1.901e+02 2.069e+02 3.367e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-12 12:11:25,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1052571.3333333333, ans=0.0 2023-10-12 12:11:46,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1052664.6666666667, ans=0.0 2023-10-12 12:11:54,647 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:11:56,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1052711.3333333333, ans=15.0 2023-10-12 12:12:14,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1052804.6666666667, ans=0.125 2023-10-12 12:12:23,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-12 12:12:29,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1052851.3333333333, ans=0.125 2023-10-12 12:12:36,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.41 vs. limit=22.5 2023-10-12 12:12:55,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1052991.3333333333, ans=10.0 2023-10-12 12:13:07,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1053038.0, ans=0.125 2023-10-12 12:13:09,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1053038.0, ans=0.125 2023-10-12 12:13:10,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.833e+02 1.987e+02 2.292e+02 3.128e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-12 12:13:12,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053038.0, ans=0.1 2023-10-12 12:13:18,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=22.5 2023-10-12 12:13:18,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1053084.6666666667, ans=0.125 2023-10-12 12:13:20,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-10-12 12:13:28,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1053131.3333333333, ans=0.125 2023-10-12 12:13:44,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1053178.0, ans=0.035 2023-10-12 12:13:56,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1053224.6666666667, ans=0.125 2023-10-12 12:13:59,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.13 vs. limit=15.0 2023-10-12 12:14:08,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1053224.6666666667, ans=0.125 2023-10-12 12:14:12,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1053271.3333333333, ans=0.0 2023-10-12 12:14:34,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1053318.0, ans=0.1 2023-10-12 12:14:45,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1053364.6666666667, ans=0.125 2023-10-12 12:14:49,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1053364.6666666667, ans=0.125 2023-10-12 12:15:08,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1053458.0, ans=0.2 2023-10-12 12:15:10,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1053458.0, ans=0.125 2023-10-12 12:15:17,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.704e+02 1.817e+02 1.965e+02 2.679e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-12 12:15:35,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1053598.0, ans=0.125 2023-10-12 12:15:48,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1053644.6666666667, ans=0.0 2023-10-12 12:15:54,900 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:15:57,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1053691.3333333333, ans=0.2 2023-10-12 12:15:57,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1053691.3333333333, ans=0.125 2023-10-12 12:16:04,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1053691.3333333333, ans=0.0 2023-10-12 12:16:25,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-10-12 12:16:39,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1053831.3333333333, ans=0.1 2023-10-12 12:16:41,718 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:16:44,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1053831.3333333333, ans=0.5 2023-10-12 12:16:58,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1053924.6666666667, ans=0.125 2023-10-12 12:17:12,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.756e+02 1.893e+02 2.102e+02 2.919e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-12 12:17:20,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1054018.0, ans=0.0 2023-10-12 12:17:29,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1054018.0, ans=0.125 2023-10-12 12:17:44,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-12 12:18:07,560 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=15.0 2023-10-12 12:18:10,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1054204.6666666667, ans=0.125 2023-10-12 12:18:12,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1054251.3333333333, ans=0.125 2023-10-12 12:18:21,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-12 12:18:27,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1054298.0, ans=0.0 2023-10-12 12:18:30,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-12 12:18:33,836 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:18:39,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1054344.6666666667, ans=0.1 2023-10-12 12:18:40,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1054344.6666666667, ans=0.125 2023-10-12 12:19:01,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.762e+02 1.907e+02 2.113e+02 3.239e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 12:19:02,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-10-12 12:19:30,854 INFO [train.py:1031] (3/4) Epoch 17, batch 7500, loss[loss=0.1885, simple_loss=0.256, pruned_loss=0.06051, over 12345.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2849, pruned_loss=0.05173, over 32057395.20 frames. ], batch size: 440, lr: 2.05e-03, grad_scale: 16.0 2023-10-12 12:19:42,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1054624.6666666667, ans=0.125 2023-10-12 12:19:58,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1054671.3333333333, ans=0.0 2023-10-12 12:20:00,039 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-10-12 12:20:06,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1054718.0, ans=0.125 2023-10-12 12:20:10,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1054718.0, ans=0.0 2023-10-12 12:20:55,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.720e+02 1.911e+02 2.056e+02 2.961e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-12 12:20:59,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1054904.6666666667, ans=0.0 2023-10-12 12:21:03,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1054951.3333333333, ans=0.2 2023-10-12 12:21:13,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1054998.0, ans=0.125 2023-10-12 12:21:18,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1054998.0, ans=0.125 2023-10-12 12:21:28,139 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.28 vs. limit=15.0 2023-10-12 12:21:50,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1055138.0, ans=0.125 2023-10-12 12:21:59,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1055138.0, ans=10.0 2023-10-12 12:22:06,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1055184.6666666667, ans=0.125 2023-10-12 12:22:31,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=22.5 2023-10-12 12:22:45,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.45 vs. limit=10.0 2023-10-12 12:22:55,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2023-10-12 12:22:56,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.746e+02 1.908e+02 2.213e+02 2.822e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-12 12:23:06,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1055418.0, ans=0.2 2023-10-12 12:23:30,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1055511.3333333333, ans=0.0 2023-10-12 12:23:32,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.44 vs. limit=15.0 2023-10-12 12:23:35,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1055511.3333333333, ans=0.0 2023-10-12 12:23:42,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1055558.0, ans=0.125 2023-10-12 12:23:59,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=15.0 2023-10-12 12:24:13,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1055698.0, ans=0.0 2023-10-12 12:24:33,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1055791.3333333333, ans=0.2 2023-10-12 12:24:34,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1055791.3333333333, ans=0.0 2023-10-12 12:24:48,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.649e+02 1.833e+02 2.010e+02 2.977e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-12 12:24:49,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.92 vs. limit=15.0 2023-10-12 12:25:26,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1056024.6666666667, ans=0.125 2023-10-12 12:25:29,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-12 12:25:41,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1056071.3333333333, ans=0.09899494936611666 2023-10-12 12:25:50,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=22.5 2023-10-12 12:25:52,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1056118.0, ans=0.0 2023-10-12 12:26:03,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1056118.0, ans=0.0 2023-10-12 12:26:17,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.17 vs. limit=15.0 2023-10-12 12:26:24,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-10-12 12:26:32,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1056258.0, ans=0.0 2023-10-12 12:26:33,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1056258.0, ans=0.125 2023-10-12 12:26:45,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.766e+02 1.924e+02 2.128e+02 2.966e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-12 12:27:08,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1056398.0, ans=0.2 2023-10-12 12:27:12,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1056444.6666666667, ans=0.0 2023-10-12 12:27:42,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1056538.0, ans=0.125 2023-10-12 12:27:46,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1056538.0, ans=0.0 2023-10-12 12:28:08,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.06 vs. limit=15.0 2023-10-12 12:28:12,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1056678.0, ans=0.125 2023-10-12 12:28:18,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.43 vs. limit=5.0 2023-10-12 12:28:18,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=15.0 2023-10-12 12:28:20,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1056678.0, ans=0.125 2023-10-12 12:28:21,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1056678.0, ans=0.0 2023-10-12 12:28:41,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1056771.3333333333, ans=0.1 2023-10-12 12:28:43,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.618e+02 1.746e+02 1.903e+02 2.676e+02, threshold=3.492e+02, percent-clipped=0.0 2023-10-12 12:28:48,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1056818.0, ans=0.025 2023-10-12 12:28:50,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=22.5 2023-10-12 12:28:53,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1056818.0, ans=0.125 2023-10-12 12:28:55,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-10-12 12:29:09,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1056911.3333333333, ans=0.125 2023-10-12 12:29:11,147 INFO [train.py:1031] (3/4) Epoch 17, batch 8000, loss[loss=0.1961, simple_loss=0.2894, pruned_loss=0.05147, over 16962.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2843, pruned_loss=0.05122, over 32229047.22 frames. ], batch size: 138, lr: 2.04e-03, grad_scale: 32.0 2023-10-12 12:30:10,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1057144.6666666667, ans=0.125 2023-10-12 12:30:30,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1057238.0, ans=0.1 2023-10-12 12:30:32,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.608e+02 1.770e+02 1.954e+02 2.497e+02, threshold=3.541e+02, percent-clipped=0.0 2023-10-12 12:30:43,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1057284.6666666667, ans=0.0 2023-10-12 12:30:57,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1057331.3333333333, ans=0.0 2023-10-12 12:31:15,501 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:31:27,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.61 vs. limit=15.0 2023-10-12 12:31:28,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1057471.3333333333, ans=0.2 2023-10-12 12:31:38,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1057518.0, ans=0.125 2023-10-12 12:31:43,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1057564.6666666667, ans=0.0 2023-10-12 12:31:43,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2023-10-12 12:31:52,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1057564.6666666667, ans=0.125 2023-10-12 12:32:14,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1057658.0, ans=0.125 2023-10-12 12:32:34,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1057704.6666666667, ans=0.1 2023-10-12 12:32:36,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.811e+02 1.950e+02 2.285e+02 3.064e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 12:32:59,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-10-12 12:32:59,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=22.5 2023-10-12 12:33:33,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057938.0, ans=0.1 2023-10-12 12:33:39,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1057938.0, ans=0.125 2023-10-12 12:33:44,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1057984.6666666667, ans=0.0 2023-10-12 12:33:45,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1057984.6666666667, ans=0.0 2023-10-12 12:34:33,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.767e+02 1.935e+02 2.120e+02 2.620e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 12:34:46,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1058218.0, ans=0.125 2023-10-12 12:34:57,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1058264.6666666667, ans=0.2 2023-10-12 12:35:02,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1058311.3333333333, ans=0.1 2023-10-12 12:35:05,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-10-12 12:35:09,976 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-12 12:35:17,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1058358.0, ans=0.2 2023-10-12 12:35:18,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1058358.0, ans=0.2 2023-10-12 12:35:29,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1058404.6666666667, ans=0.125 2023-10-12 12:35:41,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1058451.3333333333, ans=0.2 2023-10-12 12:35:42,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-10-12 12:35:45,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058498.0, ans=0.1 2023-10-12 12:36:01,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=22.5 2023-10-12 12:36:06,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058544.6666666667, ans=0.1 2023-10-12 12:36:19,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1058638.0, ans=0.125 2023-10-12 12:36:21,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1058638.0, ans=0.125 2023-10-12 12:36:23,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1058638.0, ans=0.125 2023-10-12 12:36:27,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.755e+02 1.937e+02 2.086e+02 2.976e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 12:36:32,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1058684.6666666667, ans=0.0 2023-10-12 12:36:36,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1058684.6666666667, ans=0.05 2023-10-12 12:36:45,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1058731.3333333333, ans=0.0 2023-10-12 12:36:47,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1058731.3333333333, ans=0.0 2023-10-12 12:36:56,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1058778.0, ans=0.0 2023-10-12 12:37:32,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1058918.0, ans=0.0 2023-10-12 12:37:45,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1058964.6666666667, ans=0.2 2023-10-12 12:37:54,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1059011.3333333333, ans=10.0 2023-10-12 12:37:58,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1059011.3333333333, ans=0.0 2023-10-12 12:38:05,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1059058.0, ans=0.125 2023-10-12 12:38:18,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1059104.6666666667, ans=15.0 2023-10-12 12:38:20,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1059104.6666666667, ans=0.125 2023-10-12 12:38:24,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.742e+02 1.967e+02 2.236e+02 2.982e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-12 12:38:32,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1059151.3333333333, ans=0.125 2023-10-12 12:38:36,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1059151.3333333333, ans=0.125 2023-10-12 12:38:40,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1059151.3333333333, ans=0.0 2023-10-12 12:38:47,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1059198.0, ans=0.125 2023-10-12 12:38:54,351 INFO [train.py:1031] (3/4) Epoch 17, batch 8500, loss[loss=0.1934, simple_loss=0.2745, pruned_loss=0.05617, over 15346.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2848, pruned_loss=0.05136, over 32366939.92 frames. ], batch size: 35, lr: 2.04e-03, grad_scale: 16.0 2023-10-12 12:38:57,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1059244.6666666667, ans=0.125 2023-10-12 12:39:06,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1059291.3333333333, ans=0.125 2023-10-12 12:39:06,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1059291.3333333333, ans=0.2 2023-10-12 12:39:07,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1059291.3333333333, ans=0.125 2023-10-12 12:39:29,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1059384.6666666667, ans=0.0 2023-10-12 12:39:30,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-10-12 12:39:45,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1059431.3333333333, ans=0.125 2023-10-12 12:39:50,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.40 vs. limit=15.0 2023-10-12 12:40:12,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1059524.6666666667, ans=0.07 2023-10-12 12:40:17,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1059571.3333333333, ans=0.125 2023-10-12 12:40:23,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.834e+02 2.054e+02 2.316e+02 3.239e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-12 12:40:31,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2023-10-12 12:40:49,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-10-12 12:40:59,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1059711.3333333333, ans=0.0 2023-10-12 12:41:12,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1059758.0, ans=0.0 2023-10-12 12:41:22,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-12 12:41:26,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059804.6666666667, ans=0.1 2023-10-12 12:41:47,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1059898.0, ans=0.125 2023-10-12 12:41:58,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1059944.6666666667, ans=0.0 2023-10-12 12:42:26,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1060038.0, ans=0.125 2023-10-12 12:42:28,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.669e+02 1.819e+02 1.997e+02 2.629e+02, threshold=3.638e+02, percent-clipped=0.0 2023-10-12 12:42:46,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-10-12 12:43:20,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1060271.3333333333, ans=0.1 2023-10-12 12:43:22,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-10-12 12:43:56,267 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:44:03,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1060458.0, ans=0.0 2023-10-12 12:44:07,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1060458.0, ans=0.125 2023-10-12 12:44:26,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.628e+02 1.744e+02 1.938e+02 2.517e+02, threshold=3.487e+02, percent-clipped=0.0 2023-10-12 12:44:44,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1060598.0, ans=0.0 2023-10-12 12:45:03,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1060691.3333333333, ans=0.125 2023-10-12 12:45:04,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1060691.3333333333, ans=0.1 2023-10-12 12:45:13,534 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:45:28,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1060784.6666666667, ans=0.125 2023-10-12 12:45:38,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1060831.3333333333, ans=0.0 2023-10-12 12:45:43,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1060878.0, ans=0.1 2023-10-12 12:45:43,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1060878.0, ans=0.125 2023-10-12 12:46:04,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1060971.3333333333, ans=0.125 2023-10-12 12:46:13,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-12 12:46:14,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.250e+02 1.695e+02 1.807e+02 1.988e+02 3.231e+02, threshold=3.613e+02, percent-clipped=0.0 2023-10-12 12:46:27,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1061064.6666666667, ans=0.125 2023-10-12 12:46:52,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1061158.0, ans=0.2 2023-10-12 12:46:55,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1061158.0, ans=0.0 2023-10-12 12:47:06,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.23 vs. limit=10.0 2023-10-12 12:47:16,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1061251.3333333333, ans=0.125 2023-10-12 12:47:24,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1061298.0, ans=0.0 2023-10-12 12:47:31,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1061344.6666666667, ans=0.125 2023-10-12 12:47:47,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1061391.3333333333, ans=0.125 2023-10-12 12:47:53,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1061438.0, ans=0.125 2023-10-12 12:48:00,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1061438.0, ans=0.125 2023-10-12 12:48:04,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.721e+02 1.907e+02 2.136e+02 3.255e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-12 12:48:05,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1061484.6666666667, ans=0.125 2023-10-12 12:48:18,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.86 vs. limit=6.0 2023-10-12 12:48:20,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1061531.3333333333, ans=0.125 2023-10-12 12:48:25,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1061531.3333333333, ans=0.125 2023-10-12 12:48:26,731 INFO [train.py:1031] (3/4) Epoch 17, batch 9000, loss[loss=0.1886, simple_loss=0.2769, pruned_loss=0.05017, over 15394.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2842, pruned_loss=0.05109, over 32475403.40 frames. ], batch size: 35, lr: 2.04e-03, grad_scale: 8.0 2023-10-12 12:48:39,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1061624.6666666667, ans=0.07 2023-10-12 12:48:40,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1061624.6666666667, ans=0.125 2023-10-12 12:48:50,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1061671.3333333333, ans=0.125 2023-10-12 12:49:00,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1061718.0, ans=0.1 2023-10-12 12:49:07,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-10-12 12:49:29,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1061858.0, ans=0.0 2023-10-12 12:49:40,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1061904.6666666667, ans=0.125 2023-10-12 12:49:45,679 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:49:47,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.676e+02 1.876e+02 2.157e+02 3.505e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 12:50:08,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062044.6666666667, ans=0.125 2023-10-12 12:50:43,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=22.5 2023-10-12 12:50:52,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.42 vs. limit=22.5 2023-10-12 12:50:54,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-10-12 12:51:00,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.18 vs. limit=22.5 2023-10-12 12:51:20,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1062324.6666666667, ans=0.125 2023-10-12 12:51:23,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1062324.6666666667, ans=0.125 2023-10-12 12:51:37,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.774e+02 1.870e+02 2.130e+02 3.113e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 12:51:41,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-10-12 12:51:41,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1062418.0, ans=0.09899494936611666 2023-10-12 12:51:47,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1062418.0, ans=0.125 2023-10-12 12:52:06,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-10-12 12:52:33,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1062651.3333333333, ans=0.125 2023-10-12 12:52:38,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062651.3333333333, ans=0.1 2023-10-12 12:52:38,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062651.3333333333, ans=0.125 2023-10-12 12:52:50,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.16 vs. limit=22.5 2023-10-12 12:52:53,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.53 vs. limit=15.0 2023-10-12 12:52:54,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1062744.6666666667, ans=0.125 2023-10-12 12:53:14,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062838.0, ans=0.1 2023-10-12 12:53:15,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-12 12:53:16,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1062838.0, ans=0.125 2023-10-12 12:53:22,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.715e+02 1.910e+02 2.133e+02 3.286e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 12:53:30,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-12 12:53:57,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1063024.6666666667, ans=0.0 2023-10-12 12:54:35,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1063164.6666666667, ans=0.125 2023-10-12 12:54:45,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1063211.3333333333, ans=0.09899494936611666 2023-10-12 12:54:48,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063211.3333333333, ans=0.1 2023-10-12 12:54:57,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1063258.0, ans=0.125 2023-10-12 12:55:10,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1063258.0, ans=0.125 2023-10-12 12:55:11,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1063304.6666666667, ans=0.0 2023-10-12 12:55:12,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1063304.6666666667, ans=0.0 2023-10-12 12:55:23,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.816e+02 1.959e+02 2.108e+02 2.932e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 12:55:35,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063398.0, ans=0.1 2023-10-12 12:55:47,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-10-12 12:56:01,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1063491.3333333333, ans=0.125 2023-10-12 12:56:08,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1063538.0, ans=0.1 2023-10-12 12:56:35,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-12 12:56:35,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=22.5 2023-10-12 12:56:36,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1063631.3333333333, ans=0.125 2023-10-12 12:56:44,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-10-12 12:56:51,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-10-12 12:56:54,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1063724.6666666667, ans=0.1 2023-10-12 12:57:19,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.839e+02 2.048e+02 2.293e+02 3.164e+02, threshold=4.096e+02, percent-clipped=0.0 2023-10-12 12:57:29,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1063818.0, ans=0.1 2023-10-12 12:57:33,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1063864.6666666667, ans=0.125 2023-10-12 12:57:35,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063864.6666666667, ans=0.1 2023-10-12 12:57:36,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2023-10-12 12:57:37,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.42 vs. limit=10.0 2023-10-12 12:57:40,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.50 vs. limit=6.0 2023-10-12 12:57:41,878 INFO [train.py:1031] (3/4) Epoch 17, batch 9500, loss[loss=0.1961, simple_loss=0.284, pruned_loss=0.05412, over 16914.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2849, pruned_loss=0.05141, over 32533113.42 frames. ], batch size: 72, lr: 2.04e-03, grad_scale: 8.0 2023-10-12 12:57:48,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1063911.3333333333, ans=0.0 2023-10-12 12:57:50,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1063911.3333333333, ans=0.0 2023-10-12 12:58:32,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1064098.0, ans=0.2 2023-10-12 12:58:34,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1064098.0, ans=0.125 2023-10-12 12:58:35,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1064098.0, ans=0.2 2023-10-12 12:58:47,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.98 vs. limit=15.0 2023-10-12 12:58:53,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1064191.3333333333, ans=0.0 2023-10-12 12:59:01,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1064238.0, ans=0.2 2023-10-12 12:59:10,509 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.704e+02 1.873e+02 2.130e+02 2.900e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-12 12:59:53,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.91 vs. limit=15.0 2023-10-12 13:00:03,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1064471.3333333333, ans=0.125 2023-10-12 13:00:03,621 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:00:09,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1064518.0, ans=0.025 2023-10-12 13:00:56,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1064704.6666666667, ans=0.1 2023-10-12 13:00:56,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1064704.6666666667, ans=0.2 2023-10-12 13:01:01,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1064704.6666666667, ans=0.125 2023-10-12 13:01:06,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.691e+02 1.938e+02 2.154e+02 3.488e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-12 13:01:24,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.28 vs. limit=15.0 2023-10-12 13:01:38,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1064891.3333333333, ans=0.1 2023-10-12 13:01:46,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1064891.3333333333, ans=0.125 2023-10-12 13:01:46,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.81 vs. limit=10.0 2023-10-12 13:02:12,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065031.3333333333, ans=0.1 2023-10-12 13:02:25,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1065078.0, ans=0.2 2023-10-12 13:02:58,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.804e+02 1.966e+02 2.153e+02 2.862e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-12 13:03:04,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1065218.0, ans=0.2 2023-10-12 13:03:07,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1065218.0, ans=0.0 2023-10-12 13:03:16,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2023-10-12 13:03:26,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-10-12 13:03:33,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1065358.0, ans=0.125 2023-10-12 13:03:39,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.92 vs. limit=10.0 2023-10-12 13:03:57,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-10-12 13:04:07,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1065498.0, ans=0.125 2023-10-12 13:04:38,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1065638.0, ans=0.125 2023-10-12 13:04:47,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.679e+02 1.810e+02 2.151e+02 3.239e+02, threshold=3.621e+02, percent-clipped=0.0 2023-10-12 13:05:26,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-10-12 13:05:30,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1065871.3333333333, ans=0.125 2023-10-12 13:05:36,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1065871.3333333333, ans=0.125 2023-10-12 13:05:42,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1065918.0, ans=0.125 2023-10-12 13:05:52,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1065964.6666666667, ans=0.125 2023-10-12 13:05:54,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1065964.6666666667, ans=0.125 2023-10-12 13:05:57,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1065964.6666666667, ans=0.0 2023-10-12 13:05:58,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065964.6666666667, ans=0.1 2023-10-12 13:06:16,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.83 vs. limit=22.5 2023-10-12 13:06:19,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1066058.0, ans=0.2 2023-10-12 13:06:25,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-10-12 13:06:27,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1066104.6666666667, ans=22.5 2023-10-12 13:06:28,843 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.90 vs. limit=22.5 2023-10-12 13:06:33,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.705e+02 1.860e+02 2.121e+02 3.213e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-12 13:06:42,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1066198.0, ans=0.1 2023-10-12 13:06:47,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1066198.0, ans=12.0 2023-10-12 13:06:53,062 INFO [train.py:1031] (3/4) Epoch 17, batch 10000, loss[loss=0.1994, simple_loss=0.2734, pruned_loss=0.06266, over 16000.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2841, pruned_loss=0.05111, over 32603460.42 frames. ], batch size: 296, lr: 2.04e-03, grad_scale: 16.0 2023-10-12 13:06:57,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-12 13:07:02,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1066244.6666666667, ans=0.125 2023-10-12 13:07:10,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1066291.3333333333, ans=0.05 2023-10-12 13:07:46,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1066478.0, ans=0.125 2023-10-12 13:08:10,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1066571.3333333333, ans=0.0 2023-10-12 13:08:14,347 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:08:22,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.734e+02 1.873e+02 2.050e+02 3.057e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-12 13:08:35,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-12 13:08:48,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1066711.3333333333, ans=0.125 2023-10-12 13:09:23,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1066851.3333333333, ans=0.125 2023-10-12 13:09:30,644 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:10:09,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1067038.0, ans=0.07 2023-10-12 13:10:11,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1067084.6666666667, ans=12.0 2023-10-12 13:10:11,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.747e+02 1.915e+02 2.150e+02 3.032e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-12 13:10:22,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=15.0 2023-10-12 13:10:47,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1067224.6666666667, ans=0.0 2023-10-12 13:11:17,750 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-12 13:11:29,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-10-12 13:11:32,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1067411.3333333333, ans=0.125 2023-10-12 13:11:33,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1067411.3333333333, ans=0.05 2023-10-12 13:11:43,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1067458.0, ans=0.035 2023-10-12 13:12:07,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1067551.3333333333, ans=0.2 2023-10-12 13:12:09,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.779e+02 1.929e+02 2.141e+02 2.754e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 13:12:54,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1067738.0, ans=0.125 2023-10-12 13:12:54,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1067738.0, ans=0.125 2023-10-12 13:13:14,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1067831.3333333333, ans=0.2 2023-10-12 13:13:18,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1067831.3333333333, ans=0.0 2023-10-12 13:13:21,854 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.61 vs. limit=15.0 2023-10-12 13:14:02,365 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.713e+02 1.858e+02 2.086e+02 2.743e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-12 13:14:26,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1068111.3333333333, ans=0.125 2023-10-12 13:14:34,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1068111.3333333333, ans=10.0 2023-10-12 13:14:42,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-12 13:15:10,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-10-12 13:15:14,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.95 vs. limit=15.0 2023-10-12 13:15:22,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.11 vs. limit=15.0 2023-10-12 13:15:24,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2023-10-12 13:15:27,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1068344.6666666667, ans=0.0 2023-10-12 13:15:30,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-10-12 13:15:39,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068391.3333333333, ans=0.1 2023-10-12 13:15:53,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.763e+02 2.004e+02 2.248e+02 3.077e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-12 13:16:06,841 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:16:09,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1068531.3333333333, ans=0.0 2023-10-12 13:16:13,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1068578.0, ans=0.2 2023-10-12 13:16:14,297 INFO [train.py:1031] (3/4) Epoch 17, batch 10500, loss[loss=0.1944, simple_loss=0.2896, pruned_loss=0.04956, over 16891.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2846, pruned_loss=0.05127, over 32639937.81 frames. ], batch size: 87, lr: 2.03e-03, grad_scale: 32.0 2023-10-12 13:16:37,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1068671.3333333333, ans=0.125 2023-10-12 13:16:43,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1068718.0, ans=0.2 2023-10-12 13:16:45,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1068718.0, ans=0.125 2023-10-12 13:16:55,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-10-12 13:16:59,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1068764.6666666667, ans=0.125 2023-10-12 13:17:03,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1068764.6666666667, ans=0.0 2023-10-12 13:17:03,978 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2023-10-12 13:17:08,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1068811.3333333333, ans=0.0 2023-10-12 13:17:18,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1068858.0, ans=0.2 2023-10-12 13:17:48,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.757e+02 1.856e+02 2.039e+02 2.751e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 13:17:55,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068951.3333333333, ans=0.1 2023-10-12 13:17:55,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1068951.3333333333, ans=0.2 2023-10-12 13:18:25,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1069091.3333333333, ans=0.0 2023-10-12 13:18:27,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1069091.3333333333, ans=0.2 2023-10-12 13:18:59,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1069231.3333333333, ans=0.125 2023-10-12 13:18:59,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1069231.3333333333, ans=0.125 2023-10-12 13:19:00,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069231.3333333333, ans=0.1 2023-10-12 13:19:02,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1069231.3333333333, ans=0.1 2023-10-12 13:19:06,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1069278.0, ans=0.125 2023-10-12 13:19:07,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.45 vs. limit=22.5 2023-10-12 13:19:13,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1069278.0, ans=0.2 2023-10-12 13:19:26,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1069371.3333333333, ans=0.0 2023-10-12 13:19:35,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1069371.3333333333, ans=0.125 2023-10-12 13:19:36,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1069371.3333333333, ans=0.0 2023-10-12 13:19:36,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1069371.3333333333, ans=0.125 2023-10-12 13:19:42,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.765e+02 1.930e+02 2.122e+02 3.086e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-12 13:19:48,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1069418.0, ans=0.125 2023-10-12 13:19:50,112 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:20:15,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-10-12 13:20:53,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1069698.0, ans=0.125 2023-10-12 13:21:06,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1069744.6666666667, ans=0.0 2023-10-12 13:21:26,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1069838.0, ans=0.0 2023-10-12 13:21:32,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.875e+02 2.089e+02 2.329e+02 3.653e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-12 13:21:38,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1069884.6666666667, ans=0.125 2023-10-12 13:21:48,157 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=12.0 2023-10-12 13:21:51,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1069931.3333333333, ans=0.025 2023-10-12 13:21:52,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.47 vs. limit=22.5 2023-10-12 13:22:04,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1070024.6666666667, ans=0.0 2023-10-12 13:22:09,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1070024.6666666667, ans=0.5 2023-10-12 13:22:45,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.30 vs. limit=22.5 2023-10-12 13:23:03,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1070258.0, ans=0.125 2023-10-12 13:23:07,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.57 vs. limit=15.0 2023-10-12 13:23:09,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.47 vs. limit=22.5 2023-10-12 13:23:11,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1070304.6666666667, ans=0.125 2023-10-12 13:23:13,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1070304.6666666667, ans=0.015 2023-10-12 13:23:19,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1070304.6666666667, ans=0.0 2023-10-12 13:23:23,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.702e+02 1.853e+02 2.050e+02 3.321e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-12 13:23:37,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1070398.0, ans=0.015 2023-10-12 13:23:47,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1070444.6666666667, ans=0.125 2023-10-12 13:23:48,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1070444.6666666667, ans=0.0 2023-10-12 13:23:51,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1070444.6666666667, ans=0.04949747468305833 2023-10-12 13:24:04,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1070538.0, ans=0.0 2023-10-12 13:24:07,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1070538.0, ans=0.125 2023-10-12 13:24:15,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070584.6666666667, ans=0.1 2023-10-12 13:24:24,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1070584.6666666667, ans=0.07 2023-10-12 13:24:26,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1070631.3333333333, ans=0.2 2023-10-12 13:24:28,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1070631.3333333333, ans=0.0 2023-10-12 13:24:29,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=22.5 2023-10-12 13:24:36,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-10-12 13:24:48,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-10-12 13:25:14,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.790e+02 1.989e+02 2.238e+02 3.576e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-12 13:25:24,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1070864.6666666667, ans=0.125 2023-10-12 13:25:25,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1070864.6666666667, ans=0.0 2023-10-12 13:25:32,116 INFO [train.py:1031] (3/4) Epoch 17, batch 11000, loss[loss=0.1828, simple_loss=0.2875, pruned_loss=0.0391, over 16780.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2845, pruned_loss=0.0513, over 32652292.05 frames. ], batch size: 188, lr: 2.03e-03, grad_scale: 16.0 2023-10-12 13:25:40,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1070911.3333333333, ans=0.125 2023-10-12 13:25:42,770 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=7.559e-02 2023-10-12 13:25:55,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071004.6666666667, ans=0.1 2023-10-12 13:26:14,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-10-12 13:26:17,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1071098.0, ans=0.125 2023-10-12 13:26:51,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1071238.0, ans=0.0 2023-10-12 13:26:56,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1071238.0, ans=0.125 2023-10-12 13:27:03,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.735e+02 1.926e+02 2.173e+02 3.469e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-12 13:27:28,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1071378.0, ans=0.125 2023-10-12 13:27:38,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1071424.6666666667, ans=0.125 2023-10-12 13:27:46,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1071424.6666666667, ans=0.0 2023-10-12 13:27:54,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1071471.3333333333, ans=0.09899494936611666 2023-10-12 13:28:35,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.75 vs. limit=22.5 2023-10-12 13:28:47,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1071658.0, ans=0.125 2023-10-12 13:28:53,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1071704.6666666667, ans=0.0 2023-10-12 13:28:53,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=22.5 2023-10-12 13:29:03,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.340e+02 1.618e+02 1.812e+02 1.995e+02 3.412e+02, threshold=3.624e+02, percent-clipped=0.0 2023-10-12 13:29:10,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1071798.0, ans=0.0 2023-10-12 13:29:11,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1071798.0, ans=0.125 2023-10-12 13:29:15,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1071798.0, ans=0.1 2023-10-12 13:29:19,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1071798.0, ans=0.125 2023-10-12 13:29:21,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1071844.6666666667, ans=0.0 2023-10-12 13:29:32,430 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:29:33,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1071891.3333333333, ans=0.035 2023-10-12 13:29:40,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1071891.3333333333, ans=0.125 2023-10-12 13:29:47,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1071938.0, ans=0.0 2023-10-12 13:30:02,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1071984.6666666667, ans=0.0 2023-10-12 13:30:04,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1072031.3333333333, ans=0.125 2023-10-12 13:30:08,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1072031.3333333333, ans=0.0 2023-10-12 13:30:11,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1072031.3333333333, ans=0.1 2023-10-12 13:30:24,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.44 vs. limit=22.5 2023-10-12 13:30:39,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1072171.3333333333, ans=0.125 2023-10-12 13:30:44,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1072171.3333333333, ans=0.04949747468305833 2023-10-12 13:30:53,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1072218.0, ans=0.0 2023-10-12 13:30:55,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.707e+02 1.882e+02 2.064e+02 2.663e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-12 13:30:58,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.90 vs. limit=10.0 2023-10-12 13:31:12,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-10-12 13:31:16,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1072311.3333333333, ans=0.125 2023-10-12 13:31:17,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1072311.3333333333, ans=0.125 2023-10-12 13:31:34,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1072358.0, ans=0.125 2023-10-12 13:31:37,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1072404.6666666667, ans=0.125 2023-10-12 13:31:41,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1072404.6666666667, ans=0.1 2023-10-12 13:31:42,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072404.6666666667, ans=0.1 2023-10-12 13:31:50,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1072451.3333333333, ans=0.0 2023-10-12 13:32:13,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-10-12 13:32:37,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072638.0, ans=0.1 2023-10-12 13:32:48,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.792e+02 1.979e+02 2.254e+02 3.040e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-12 13:32:52,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-10-12 13:32:53,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2023-10-12 13:32:53,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1072684.6666666667, ans=0.0 2023-10-12 13:33:00,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1072731.3333333333, ans=0.025 2023-10-12 13:33:10,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1072778.0, ans=0.125 2023-10-12 13:33:10,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1072778.0, ans=0.1 2023-10-12 13:33:16,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1072824.6666666667, ans=0.0 2023-10-12 13:33:19,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1072824.6666666667, ans=0.0 2023-10-12 13:33:40,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1072918.0, ans=0.125 2023-10-12 13:33:42,193 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.92 vs. limit=6.0 2023-10-12 13:33:43,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1072918.0, ans=0.125 2023-10-12 13:33:58,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.78 vs. limit=6.0 2023-10-12 13:34:45,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.824e+02 2.018e+02 2.300e+02 3.205e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-12 13:35:06,606 INFO [train.py:1031] (3/4) Epoch 17, batch 11500, loss[loss=0.1969, simple_loss=0.2932, pruned_loss=0.05031, over 16607.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2841, pruned_loss=0.05125, over 32664338.00 frames. ], batch size: 241, lr: 2.03e-03, grad_scale: 32.0 2023-10-12 13:35:10,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1073244.6666666667, ans=0.125 2023-10-12 13:35:12,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1073244.6666666667, ans=0.1 2023-10-12 13:35:26,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1073291.3333333333, ans=0.125 2023-10-12 13:35:30,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1073338.0, ans=0.125 2023-10-12 13:35:48,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1073431.3333333333, ans=0.2 2023-10-12 13:36:12,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.67 vs. limit=15.0 2023-10-12 13:36:29,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=12.0 2023-10-12 13:36:32,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1073571.3333333333, ans=0.125 2023-10-12 13:36:36,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1073571.3333333333, ans=0.0 2023-10-12 13:36:36,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1073571.3333333333, ans=0.125 2023-10-12 13:36:42,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.697e+02 1.844e+02 2.086e+02 2.731e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-12 13:36:50,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1073618.0, ans=0.125 2023-10-12 13:37:05,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1073711.3333333333, ans=0.125 2023-10-12 13:37:51,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1073851.3333333333, ans=0.125 2023-10-12 13:37:53,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1073898.0, ans=0.125 2023-10-12 13:38:01,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1073898.0, ans=0.0 2023-10-12 13:38:01,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1073898.0, ans=0.05 2023-10-12 13:38:03,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1073898.0, ans=0.125 2023-10-12 13:38:04,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-10-12 13:38:24,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1073991.3333333333, ans=0.0 2023-10-12 13:38:36,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1074038.0, ans=0.125 2023-10-12 13:38:43,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.642e+02 1.835e+02 2.085e+02 2.917e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-12 13:38:45,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1074084.6666666667, ans=0.125 2023-10-12 13:38:52,980 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:39:10,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1074178.0, ans=0.2 2023-10-12 13:39:14,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1074224.6666666667, ans=0.2 2023-10-12 13:39:14,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1074224.6666666667, ans=0.125 2023-10-12 13:39:19,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1074224.6666666667, ans=0.0 2023-10-12 13:39:27,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074271.3333333333, ans=0.1 2023-10-12 13:39:29,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1074271.3333333333, ans=0.2 2023-10-12 13:39:40,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1074318.0, ans=0.125 2023-10-12 13:39:42,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1074318.0, ans=0.95 2023-10-12 13:39:44,125 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:40:02,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.27 vs. limit=10.0 2023-10-12 13:40:04,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1074411.3333333333, ans=0.125 2023-10-12 13:40:04,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=15.0 2023-10-12 13:40:11,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=12.0 2023-10-12 13:40:44,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1074551.3333333333, ans=0.125 2023-10-12 13:40:48,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1074551.3333333333, ans=0.125 2023-10-12 13:40:49,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.746e+02 1.944e+02 2.115e+02 3.203e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-12 13:40:55,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1074598.0, ans=0.0 2023-10-12 13:41:11,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=22.5 2023-10-12 13:41:14,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074644.6666666667, ans=0.1 2023-10-12 13:41:26,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074691.3333333333, ans=0.1 2023-10-12 13:41:47,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-10-12 13:41:49,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1074784.6666666667, ans=0.0 2023-10-12 13:42:41,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1075018.0, ans=0.015 2023-10-12 13:42:47,958 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.715e+02 1.909e+02 2.108e+02 2.697e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 13:43:09,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1075111.3333333333, ans=0.125 2023-10-12 13:43:40,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1075251.3333333333, ans=0.2 2023-10-12 13:44:06,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1075344.6666666667, ans=0.125 2023-10-12 13:44:25,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-10-12 13:44:32,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1075438.0, ans=0.125 2023-10-12 13:44:37,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1075484.6666666667, ans=0.0 2023-10-12 13:44:38,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.82 vs. limit=15.0 2023-10-12 13:44:40,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1075484.6666666667, ans=0.1 2023-10-12 13:44:42,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.733e+02 1.913e+02 2.164e+02 3.183e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-12 13:44:47,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1075531.3333333333, ans=0.125 2023-10-12 13:44:51,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1075531.3333333333, ans=0.125 2023-10-12 13:44:56,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1075531.3333333333, ans=0.07 2023-10-12 13:44:57,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.67 vs. limit=10.0 2023-10-12 13:44:59,348 INFO [train.py:1031] (3/4) Epoch 17, batch 12000, loss[loss=0.199, simple_loss=0.2963, pruned_loss=0.05085, over 16883.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2842, pruned_loss=0.05106, over 32698674.32 frames. ], batch size: 116, lr: 2.03e-03, grad_scale: 32.0 2023-10-12 13:45:29,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1075671.3333333333, ans=0.125 2023-10-12 13:45:54,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1075764.6666666667, ans=0.0 2023-10-12 13:45:57,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1075811.3333333333, ans=0.0 2023-10-12 13:45:58,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1075811.3333333333, ans=0.125 2023-10-12 13:46:08,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1075811.3333333333, ans=0.1 2023-10-12 13:46:09,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1075858.0, ans=0.0 2023-10-12 13:46:32,735 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:46:40,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.667e+02 1.885e+02 2.136e+02 3.356e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-12 13:46:43,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1075998.0, ans=0.125 2023-10-12 13:46:58,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1076044.6666666667, ans=0.1 2023-10-12 13:47:26,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1076138.0, ans=0.0 2023-10-12 13:47:47,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1076231.3333333333, ans=0.125 2023-10-12 13:48:06,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1076324.6666666667, ans=0.125 2023-10-12 13:48:19,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076371.3333333333, ans=0.1 2023-10-12 13:48:32,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.726e+02 1.841e+02 2.196e+02 3.997e+02, threshold=3.683e+02, percent-clipped=1.0 2023-10-12 13:48:33,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1076418.0, ans=0.125 2023-10-12 13:48:50,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076511.3333333333, ans=0.1 2023-10-12 13:48:50,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1076511.3333333333, ans=0.125 2023-10-12 13:48:54,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1076558.0, ans=0.125 2023-10-12 13:49:04,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1076558.0, ans=0.1 2023-10-12 13:49:14,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-12 13:49:16,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1076604.6666666667, ans=0.125 2023-10-12 13:49:20,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.56 vs. limit=6.0 2023-10-12 13:49:23,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1076651.3333333333, ans=0.125 2023-10-12 13:49:57,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1076791.3333333333, ans=0.2 2023-10-12 13:50:00,420 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:50:00,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-10-12 13:50:08,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=12.0 2023-10-12 13:50:19,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1076884.6666666667, ans=0.2 2023-10-12 13:50:22,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.697e+02 1.860e+02 2.117e+02 2.878e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 13:50:57,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=22.5 2023-10-12 13:50:57,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=22.5 2023-10-12 13:51:22,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1077118.0, ans=0.125 2023-10-12 13:51:55,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=15.0 2023-10-12 13:52:25,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.778e+02 1.970e+02 2.151e+02 3.385e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-12 13:52:31,609 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:52:39,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.06 vs. limit=6.0 2023-10-12 13:52:53,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=12.0 2023-10-12 13:52:55,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1077491.3333333333, ans=0.07 2023-10-12 13:53:06,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1077538.0, ans=0.125 2023-10-12 13:53:41,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1077678.0, ans=0.0 2023-10-12 13:53:44,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1077678.0, ans=0.0 2023-10-12 13:53:55,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.32 vs. limit=22.5 2023-10-12 13:54:04,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1077771.3333333333, ans=0.125 2023-10-12 13:54:22,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.732e+02 1.891e+02 2.114e+02 4.029e+02, threshold=3.783e+02, percent-clipped=1.0 2023-10-12 13:54:38,235 INFO [train.py:1031] (3/4) Epoch 17, batch 12500, loss[loss=0.2018, simple_loss=0.3002, pruned_loss=0.05173, over 16824.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2839, pruned_loss=0.05106, over 32710672.42 frames. ], batch size: 146, lr: 2.02e-03, grad_scale: 16.0 2023-10-12 13:55:38,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1078144.6666666667, ans=0.0 2023-10-12 13:56:12,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1078284.6666666667, ans=0.125 2023-10-12 13:56:15,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.795e+02 2.000e+02 2.282e+02 3.214e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-12 13:56:18,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-10-12 13:56:25,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1078331.3333333333, ans=0.125 2023-10-12 13:56:43,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1078424.6666666667, ans=0.125 2023-10-12 13:56:58,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1078471.3333333333, ans=0.0 2023-10-12 13:56:59,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1078471.3333333333, ans=0.0 2023-10-12 13:57:11,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.28 vs. limit=22.5 2023-10-12 13:57:24,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1078564.6666666667, ans=0.0 2023-10-12 13:57:50,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1078658.0, ans=0.0 2023-10-12 13:58:04,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1078751.3333333333, ans=0.125 2023-10-12 13:58:05,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1078751.3333333333, ans=0.1 2023-10-12 13:58:11,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.744e+02 1.905e+02 2.200e+02 3.736e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 13:58:15,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1078798.0, ans=0.05 2023-10-12 13:58:17,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1078798.0, ans=0.2 2023-10-12 13:58:22,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.18 vs. limit=10.0 2023-10-12 13:58:23,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1078798.0, ans=0.0 2023-10-12 13:58:25,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-10-12 13:58:26,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1078844.6666666667, ans=0.2 2023-10-12 13:58:30,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1078844.6666666667, ans=0.07 2023-10-12 13:58:31,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.13 vs. limit=15.0 2023-10-12 13:58:33,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1078844.6666666667, ans=0.125 2023-10-12 13:58:33,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.05 vs. limit=15.0 2023-10-12 13:58:49,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1078938.0, ans=0.0 2023-10-12 13:58:56,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1078984.6666666667, ans=0.125 2023-10-12 13:58:57,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1078984.6666666667, ans=0.0 2023-10-12 13:59:42,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1079171.3333333333, ans=0.125 2023-10-12 13:59:46,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-12 13:59:47,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-10-12 13:59:50,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1079171.3333333333, ans=0.0 2023-10-12 13:59:51,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.13 vs. limit=22.5 2023-10-12 13:59:58,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1079218.0, ans=0.2 2023-10-12 14:00:01,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.703e+02 1.890e+02 2.203e+02 3.691e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 14:00:06,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1079264.6666666667, ans=0.125 2023-10-12 14:00:07,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1079264.6666666667, ans=0.2 2023-10-12 14:00:23,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1079311.3333333333, ans=0.125 2023-10-12 14:00:24,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-12 14:00:36,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1079404.6666666667, ans=0.0 2023-10-12 14:00:46,714 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-10-12 14:01:14,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1079544.6666666667, ans=0.0 2023-10-12 14:01:16,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079544.6666666667, ans=0.1 2023-10-12 14:01:19,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1079544.6666666667, ans=0.025 2023-10-12 14:01:35,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1079591.3333333333, ans=0.125 2023-10-12 14:01:41,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1079638.0, ans=0.0 2023-10-12 14:01:59,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1079684.6666666667, ans=0.0 2023-10-12 14:01:59,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.722e+02 1.835e+02 2.053e+02 4.548e+02, threshold=3.669e+02, percent-clipped=1.0 2023-10-12 14:02:00,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1079684.6666666667, ans=0.125 2023-10-12 14:02:03,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1079731.3333333333, ans=0.0 2023-10-12 14:02:11,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1079731.3333333333, ans=0.125 2023-10-12 14:02:28,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1079824.6666666667, ans=0.0 2023-10-12 14:02:30,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1079824.6666666667, ans=0.0 2023-10-12 14:02:33,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.57 vs. limit=15.0 2023-10-12 14:02:42,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1079871.3333333333, ans=0.05 2023-10-12 14:02:43,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079871.3333333333, ans=0.1 2023-10-12 14:03:09,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1079964.6666666667, ans=0.1 2023-10-12 14:03:44,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1080104.6666666667, ans=0.0 2023-10-12 14:03:53,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.726e+02 1.851e+02 2.039e+02 2.782e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-12 14:03:55,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1080198.0, ans=0.125 2023-10-12 14:04:07,871 INFO [train.py:1031] (3/4) Epoch 17, batch 13000, loss[loss=0.2045, simple_loss=0.2956, pruned_loss=0.05667, over 16985.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2845, pruned_loss=0.05117, over 32699765.96 frames. ], batch size: 123, lr: 2.02e-03, grad_scale: 16.0 2023-10-12 14:04:11,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1080244.6666666667, ans=0.125 2023-10-12 14:04:12,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-10-12 14:04:33,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.16 vs. limit=15.0 2023-10-12 14:04:40,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080338.0, ans=0.1 2023-10-12 14:04:49,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.48 vs. limit=12.0 2023-10-12 14:05:01,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1080431.3333333333, ans=0.125 2023-10-12 14:05:31,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1080524.6666666667, ans=0.125 2023-10-12 14:05:34,346 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:05:36,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.06 vs. limit=15.0 2023-10-12 14:05:48,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1080618.0, ans=10.0 2023-10-12 14:05:57,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.784e+02 1.973e+02 2.241e+02 3.500e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-12 14:06:04,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080664.6666666667, ans=0.1 2023-10-12 14:06:16,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2023-10-12 14:06:46,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1080804.6666666667, ans=0.125 2023-10-12 14:06:56,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1080851.3333333333, ans=0.0 2023-10-12 14:07:24,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1080991.3333333333, ans=0.125 2023-10-12 14:07:57,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.638e+02 1.827e+02 2.008e+02 2.937e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-12 14:08:08,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1081131.3333333333, ans=0.125 2023-10-12 14:08:09,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1081131.3333333333, ans=0.0 2023-10-12 14:08:37,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1081271.3333333333, ans=0.0 2023-10-12 14:09:00,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.02 vs. limit=15.0 2023-10-12 14:09:13,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1081411.3333333333, ans=0.125 2023-10-12 14:09:25,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1081458.0, ans=0.0 2023-10-12 14:09:47,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.29 vs. limit=22.5 2023-10-12 14:09:52,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.744e+02 1.858e+02 2.053e+02 2.712e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-12 14:10:15,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1081691.3333333333, ans=0.0 2023-10-12 14:11:12,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1081878.0, ans=0.0 2023-10-12 14:11:20,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1081924.6666666667, ans=0.2 2023-10-12 14:11:21,495 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:11:47,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.280e+02 1.777e+02 1.922e+02 2.099e+02 2.846e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 14:11:48,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1082064.6666666667, ans=0.125 2023-10-12 14:11:52,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-10-12 14:11:56,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1082064.6666666667, ans=0.125 2023-10-12 14:12:01,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1082111.3333333333, ans=0.125 2023-10-12 14:12:07,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1082111.3333333333, ans=0.125 2023-10-12 14:12:23,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1082204.6666666667, ans=0.0 2023-10-12 14:12:27,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082204.6666666667, ans=0.125 2023-10-12 14:12:37,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-10-12 14:12:40,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1082251.3333333333, ans=0.125 2023-10-12 14:13:10,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.24 vs. limit=10.0 2023-10-12 14:13:16,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1082391.3333333333, ans=0.125 2023-10-12 14:13:32,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1082484.6666666667, ans=0.0 2023-10-12 14:13:40,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.709e+02 1.975e+02 2.183e+02 3.152e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-12 14:13:40,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1082531.3333333333, ans=0.0 2023-10-12 14:13:55,941 INFO [train.py:1031] (3/4) Epoch 17, batch 13500, loss[loss=0.1888, simple_loss=0.274, pruned_loss=0.05176, over 16013.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2837, pruned_loss=0.051, over 32726375.54 frames. ], batch size: 43, lr: 2.02e-03, grad_scale: 16.0 2023-10-12 14:14:13,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1082624.6666666667, ans=0.1 2023-10-12 14:14:18,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1082671.3333333333, ans=0.0 2023-10-12 14:14:24,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2023-10-12 14:14:34,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1082718.0, ans=0.0 2023-10-12 14:14:35,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1082718.0, ans=0.5 2023-10-12 14:14:39,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1082718.0, ans=0.125 2023-10-12 14:14:43,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=15.0 2023-10-12 14:15:07,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1082858.0, ans=0.0 2023-10-12 14:15:08,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1082858.0, ans=0.1 2023-10-12 14:15:12,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1082858.0, ans=0.1 2023-10-12 14:15:32,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1082951.3333333333, ans=0.5 2023-10-12 14:15:38,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.756e+02 2.008e+02 2.458e+02 4.260e+02, threshold=4.016e+02, percent-clipped=1.0 2023-10-12 14:16:08,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1083091.3333333333, ans=0.07 2023-10-12 14:16:13,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1083138.0, ans=0.05 2023-10-12 14:16:35,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1083231.3333333333, ans=0.0 2023-10-12 14:17:22,101 INFO [train.py:1031] (3/4) Epoch 18, batch 0, loss[loss=0.1529, simple_loss=0.2447, pruned_loss=0.03055, over 16881.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2447, pruned_loss=0.03055, over 16881.00 frames. ], batch size: 165, lr: 1.96e-03, grad_scale: 32.0 2023-10-12 14:17:22,102 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-12 14:17:29,950 INFO [train.py:1063] (3/4) Epoch 18, validation: loss=0.2151, simple_loss=0.3024, pruned_loss=0.06384, over 1020973.00 frames. 2023-10-12 14:17:29,950 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-12 14:17:54,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1083394.6666666667, ans=0.035 2023-10-12 14:18:11,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.679e+02 1.904e+02 2.260e+02 3.526e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-12 14:18:49,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1083581.3333333333, ans=0.125 2023-10-12 14:19:01,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1083628.0, ans=0.125 2023-10-12 14:19:07,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1083674.6666666667, ans=0.04949747468305833 2023-10-12 14:19:22,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1083721.3333333333, ans=0.2 2023-10-12 14:19:23,679 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:19:29,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2023-10-12 14:19:32,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083768.0, ans=0.1 2023-10-12 14:19:39,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1083814.6666666667, ans=0.125 2023-10-12 14:19:47,712 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2023-10-12 14:19:57,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083861.3333333333, ans=0.1 2023-10-12 14:20:06,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.651e+02 1.848e+02 2.026e+02 2.543e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-12 14:20:24,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-12 14:20:28,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1084001.3333333333, ans=0.1 2023-10-12 14:20:43,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1084048.0, ans=0.0 2023-10-12 14:20:50,006 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:21:00,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1084094.6666666667, ans=0.125 2023-10-12 14:21:03,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1084141.3333333333, ans=0.125 2023-10-12 14:21:04,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084141.3333333333, ans=0.1 2023-10-12 14:21:16,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1084188.0, ans=0.125 2023-10-12 14:21:58,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1084374.6666666667, ans=0.05 2023-10-12 14:22:01,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.745e+02 1.941e+02 2.216e+02 3.290e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 14:22:15,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1084421.3333333333, ans=0.2 2023-10-12 14:22:15,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1084421.3333333333, ans=0.125 2023-10-12 14:22:16,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1084421.3333333333, ans=0.125 2023-10-12 14:22:46,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1084561.3333333333, ans=0.2 2023-10-12 14:22:59,525 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:23:04,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1084608.0, ans=0.1 2023-10-12 14:23:07,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-10-12 14:23:16,749 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:23:32,414 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-12 14:23:58,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.717e+02 1.944e+02 2.111e+02 2.641e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 14:24:11,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1084888.0, ans=0.125 2023-10-12 14:24:23,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.73 vs. limit=15.0 2023-10-12 14:24:25,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.65 vs. limit=15.0 2023-10-12 14:24:53,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-12 14:25:10,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1085121.3333333333, ans=0.125 2023-10-12 14:25:13,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=22.5 2023-10-12 14:25:14,761 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:25:18,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1085168.0, ans=0.125 2023-10-12 14:25:22,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1085214.6666666667, ans=0.125 2023-10-12 14:25:38,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085261.3333333333, ans=0.1 2023-10-12 14:25:44,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-12 14:25:50,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.816e+02 1.984e+02 2.210e+02 3.172e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-12 14:26:15,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1085401.3333333333, ans=0.0 2023-10-12 14:26:29,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1085448.0, ans=0.125 2023-10-12 14:26:29,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-10-12 14:26:30,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1085448.0, ans=0.125 2023-10-12 14:26:55,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1085541.3333333333, ans=0.2 2023-10-12 14:27:08,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1085588.0, ans=0.2 2023-10-12 14:27:10,586 INFO [train.py:1031] (3/4) Epoch 18, batch 500, loss[loss=0.1769, simple_loss=0.2743, pruned_loss=0.03971, over 16955.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2833, pruned_loss=0.05053, over 7295608.47 frames. ], batch size: 93, lr: 1.96e-03, grad_scale: 32.0 2023-10-12 14:27:11,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1085634.6666666667, ans=0.1 2023-10-12 14:27:12,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1085634.6666666667, ans=0.2 2023-10-12 14:27:25,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1085681.3333333333, ans=0.125 2023-10-12 14:27:31,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1085681.3333333333, ans=0.125 2023-10-12 14:27:37,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1085728.0, ans=0.0 2023-10-12 14:27:40,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.11 vs. limit=22.5 2023-10-12 14:27:45,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1085774.6666666667, ans=0.0 2023-10-12 14:27:48,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.763e+02 1.930e+02 2.167e+02 2.962e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-12 14:27:49,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1085774.6666666667, ans=0.2 2023-10-12 14:28:03,641 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:28:04,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1085821.3333333333, ans=0.125 2023-10-12 14:28:17,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1085914.6666666667, ans=0.0 2023-10-12 14:28:21,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1085914.6666666667, ans=0.2 2023-10-12 14:29:00,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1086054.6666666667, ans=0.2 2023-10-12 14:29:18,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1086148.0, ans=0.125 2023-10-12 14:29:23,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1086194.6666666667, ans=0.0 2023-10-12 14:29:29,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1086194.6666666667, ans=0.125 2023-10-12 14:29:40,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.805e+02 1.989e+02 2.229e+02 3.113e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-12 14:29:42,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1086241.3333333333, ans=0.125 2023-10-12 14:29:52,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086288.0, ans=0.1 2023-10-12 14:29:52,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1086288.0, ans=0.0 2023-10-12 14:29:54,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-10-12 14:30:15,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-10-12 14:30:22,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=22.5 2023-10-12 14:30:40,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1086521.3333333333, ans=0.125 2023-10-12 14:30:54,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1086568.0, ans=0.125 2023-10-12 14:30:57,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1086568.0, ans=0.125 2023-10-12 14:31:22,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1086708.0, ans=0.125 2023-10-12 14:31:25,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1086708.0, ans=0.125 2023-10-12 14:31:26,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.752e+02 1.928e+02 2.178e+02 3.659e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-12 14:31:27,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1086708.0, ans=0.2 2023-10-12 14:31:51,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1086801.3333333333, ans=0.0 2023-10-12 14:32:17,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1086894.6666666667, ans=0.2 2023-10-12 14:32:24,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1086941.3333333333, ans=0.0 2023-10-12 14:32:26,790 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=22.5 2023-10-12 14:32:27,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.23 vs. limit=22.5 2023-10-12 14:32:47,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.65 vs. limit=10.0 2023-10-12 14:32:54,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1087081.3333333333, ans=0.0 2023-10-12 14:33:25,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1087174.6666666667, ans=0.0 2023-10-12 14:33:29,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.777e+02 1.924e+02 2.195e+02 3.258e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-12 14:33:34,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-10-12 14:33:42,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1087221.3333333333, ans=0.09899494936611666 2023-10-12 14:33:45,714 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=22.5 2023-10-12 14:33:49,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-10-12 14:34:10,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1087361.3333333333, ans=0.125 2023-10-12 14:34:18,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1087408.0, ans=0.125 2023-10-12 14:34:24,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-12 14:34:26,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-10-12 14:34:46,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1087501.3333333333, ans=0.125 2023-10-12 14:34:53,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1087548.0, ans=10.0 2023-10-12 14:34:53,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1087548.0, ans=0.125 2023-10-12 14:35:08,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1087594.6666666667, ans=0.125 2023-10-12 14:35:18,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.711e+02 1.880e+02 2.052e+02 2.877e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-12 14:35:38,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-10-12 14:36:05,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1087828.0, ans=0.2 2023-10-12 14:36:11,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1087874.6666666667, ans=0.0 2023-10-12 14:36:21,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1087921.3333333333, ans=0.125 2023-10-12 14:36:29,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1087921.3333333333, ans=0.0 2023-10-12 14:36:30,773 INFO [train.py:1031] (3/4) Epoch 18, batch 1000, loss[loss=0.1821, simple_loss=0.2807, pruned_loss=0.04171, over 16618.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2846, pruned_loss=0.05132, over 12929484.32 frames. ], batch size: 241, lr: 1.96e-03, grad_scale: 16.0 2023-10-12 14:36:52,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1088061.3333333333, ans=0.2 2023-10-12 14:37:09,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.670e+02 1.852e+02 2.080e+02 2.940e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-12 14:37:11,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1088108.0, ans=0.0 2023-10-12 14:37:13,697 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:37:25,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.35 vs. limit=15.0 2023-10-12 14:37:29,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1088201.3333333333, ans=0.125 2023-10-12 14:37:42,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1088294.6666666667, ans=0.0 2023-10-12 14:37:59,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1088341.3333333333, ans=0.125 2023-10-12 14:38:10,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1088388.0, ans=0.125 2023-10-12 14:38:24,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.25 vs. limit=15.0 2023-10-12 14:38:43,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1088528.0, ans=0.125 2023-10-12 14:38:59,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.774e+02 1.916e+02 2.114e+02 3.248e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-12 14:39:11,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1088621.3333333333, ans=0.125 2023-10-12 14:39:11,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1088621.3333333333, ans=0.0 2023-10-12 14:39:17,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1088668.0, ans=0.0 2023-10-12 14:39:29,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2023-10-12 14:39:52,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-10-12 14:39:53,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1088761.3333333333, ans=0.0 2023-10-12 14:40:00,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1088808.0, ans=0.0 2023-10-12 14:40:37,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1088948.0, ans=0.125 2023-10-12 14:40:46,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1088994.6666666667, ans=0.125 2023-10-12 14:40:46,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1088994.6666666667, ans=0.125 2023-10-12 14:40:58,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.687e+02 1.838e+02 2.024e+02 2.738e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-12 14:41:21,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1089134.6666666667, ans=0.0 2023-10-12 14:41:29,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2023-10-12 14:41:30,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1089181.3333333333, ans=0.125 2023-10-12 14:41:31,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=22.5 2023-10-12 14:41:40,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1089228.0, ans=0.0 2023-10-12 14:41:48,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-10-12 14:41:50,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1089274.6666666667, ans=0.0 2023-10-12 14:41:54,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1089274.6666666667, ans=0.125 2023-10-12 14:41:54,815 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2023-10-12 14:42:23,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-10-12 14:42:28,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1089414.6666666667, ans=0.125 2023-10-12 14:42:45,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1089508.0, ans=0.125 2023-10-12 14:42:50,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.708e+02 1.868e+02 1.993e+02 2.854e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-12 14:42:52,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=15.0 2023-10-12 14:42:57,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1089554.6666666667, ans=22.5 2023-10-12 14:42:59,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1089554.6666666667, ans=0.125 2023-10-12 14:43:34,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1089694.6666666667, ans=0.025 2023-10-12 14:43:43,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1089741.3333333333, ans=0.125 2023-10-12 14:43:48,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1089788.0, ans=0.125 2023-10-12 14:43:55,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1089788.0, ans=0.2 2023-10-12 14:43:58,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1089834.6666666667, ans=0.0 2023-10-12 14:44:16,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1089881.3333333333, ans=0.5 2023-10-12 14:44:19,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2023-10-12 14:44:40,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.697e+02 1.832e+02 2.032e+02 3.557e+02, threshold=3.665e+02, percent-clipped=0.0 2023-10-12 14:44:50,344 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:44:58,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1090068.0, ans=0.0 2023-10-12 14:44:58,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=15.0 2023-10-12 14:45:10,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1090114.6666666667, ans=0.2 2023-10-12 14:45:40,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1090254.6666666667, ans=0.125 2023-10-12 14:45:45,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-12 14:45:45,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-10-12 14:45:52,498 INFO [train.py:1031] (3/4) Epoch 18, batch 1500, loss[loss=0.1731, simple_loss=0.2607, pruned_loss=0.04277, over 16065.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2829, pruned_loss=0.0506, over 17307054.46 frames. ], batch size: 43, lr: 1.95e-03, grad_scale: 16.0 2023-10-12 14:45:52,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1090301.3333333333, ans=0.1 2023-10-12 14:46:10,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1090348.0, ans=0.2 2023-10-12 14:46:13,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1090394.6666666667, ans=0.125 2023-10-12 14:46:16,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1090394.6666666667, ans=0.0 2023-10-12 14:46:24,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-10-12 14:46:30,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1090441.3333333333, ans=0.0 2023-10-12 14:46:33,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.715e+02 1.909e+02 2.124e+02 2.625e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-12 14:46:46,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1090534.6666666667, ans=0.125 2023-10-12 14:46:53,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1090534.6666666667, ans=0.125 2023-10-12 14:47:05,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1090581.3333333333, ans=0.1 2023-10-12 14:47:27,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.44 vs. limit=15.0 2023-10-12 14:47:38,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1090721.3333333333, ans=0.0 2023-10-12 14:47:49,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-12 14:47:54,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1090814.6666666667, ans=0.125 2023-10-12 14:48:06,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1090814.6666666667, ans=0.2 2023-10-12 14:48:06,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1090814.6666666667, ans=0.125 2023-10-12 14:48:10,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1090861.3333333333, ans=0.125 2023-10-12 14:48:10,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1090861.3333333333, ans=0.0 2023-10-12 14:48:11,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1090861.3333333333, ans=0.0 2023-10-12 14:48:12,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1090861.3333333333, ans=0.0 2023-10-12 14:48:15,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.26 vs. limit=22.5 2023-10-12 14:48:19,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1090908.0, ans=0.125 2023-10-12 14:48:26,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1090908.0, ans=0.125 2023-10-12 14:48:26,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.774e+02 1.886e+02 2.156e+02 3.073e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-12 14:48:53,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1091001.3333333333, ans=0.0 2023-10-12 14:48:57,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.15 vs. limit=22.5 2023-10-12 14:48:57,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-12 14:49:06,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1091048.0, ans=0.125 2023-10-12 14:49:08,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.46 vs. limit=22.5 2023-10-12 14:49:16,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2023-10-12 14:49:25,729 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-10-12 14:49:39,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1091188.0, ans=0.0 2023-10-12 14:49:43,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1091234.6666666667, ans=22.5 2023-10-12 14:49:43,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-10-12 14:49:56,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1091281.3333333333, ans=0.0 2023-10-12 14:50:17,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091374.6666666667, ans=0.1 2023-10-12 14:50:19,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.793e+02 2.008e+02 2.390e+02 3.613e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-12 14:50:38,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1091468.0, ans=0.0 2023-10-12 14:51:03,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1091561.3333333333, ans=0.125 2023-10-12 14:51:08,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1091608.0, ans=0.125 2023-10-12 14:51:17,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1091608.0, ans=0.125 2023-10-12 14:51:20,596 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=12.0 2023-10-12 14:51:25,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1091654.6666666667, ans=0.0 2023-10-12 14:51:29,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1091654.6666666667, ans=0.125 2023-10-12 14:51:38,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091701.3333333333, ans=0.1 2023-10-12 14:51:38,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1091701.3333333333, ans=0.05 2023-10-12 14:52:10,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1091841.3333333333, ans=0.1 2023-10-12 14:52:10,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1091841.3333333333, ans=0.125 2023-10-12 14:52:12,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.722e+02 1.895e+02 2.159e+02 2.986e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-12 14:52:20,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1091888.0, ans=0.1 2023-10-12 14:52:26,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1091934.6666666667, ans=0.0 2023-10-12 14:52:51,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1092028.0, ans=0.09899494936611666 2023-10-12 14:52:54,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092028.0, ans=0.1 2023-10-12 14:53:07,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1092121.3333333333, ans=0.125 2023-10-12 14:53:14,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=15.0 2023-10-12 14:53:15,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1092121.3333333333, ans=0.0 2023-10-12 14:53:37,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.02 vs. limit=15.0 2023-10-12 14:53:59,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1092308.0, ans=0.125 2023-10-12 14:54:05,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.22 vs. limit=15.0 2023-10-12 14:54:07,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.753e+02 1.939e+02 2.143e+02 3.142e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-12 14:54:26,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1092401.3333333333, ans=0.0 2023-10-12 14:54:34,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1092401.3333333333, ans=0.0 2023-10-12 14:54:36,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-10-12 14:54:42,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-10-12 14:55:04,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-12 14:55:12,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1092541.3333333333, ans=0.2 2023-10-12 14:55:16,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1092588.0, ans=0.05 2023-10-12 14:55:28,263 INFO [train.py:1031] (3/4) Epoch 18, batch 2000, loss[loss=0.1729, simple_loss=0.2747, pruned_loss=0.03556, over 16891.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.284, pruned_loss=0.05118, over 20705510.09 frames. ], batch size: 72, lr: 1.95e-03, grad_scale: 32.0 2023-10-12 14:55:39,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1092681.3333333333, ans=0.125 2023-10-12 14:55:48,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1092681.3333333333, ans=0.0 2023-10-12 14:56:14,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1092774.6666666667, ans=0.1 2023-10-12 14:56:15,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1092774.6666666667, ans=0.1 2023-10-12 14:56:19,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.736e+02 1.889e+02 2.096e+02 2.651e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-12 14:56:46,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-10-12 14:57:06,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092961.3333333333, ans=0.1 2023-10-12 14:57:14,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.36 vs. limit=15.0 2023-10-12 14:57:24,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1093054.6666666667, ans=0.125 2023-10-12 14:57:40,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1093101.3333333333, ans=0.0 2023-10-12 14:57:44,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1093101.3333333333, ans=0.125 2023-10-12 14:58:02,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1093148.0, ans=0.125 2023-10-12 14:58:36,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.716e+02 1.851e+02 2.138e+02 3.189e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-12 14:58:55,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.16 vs. limit=15.0 2023-10-12 14:59:05,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=15.0 2023-10-12 14:59:08,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093381.3333333333, ans=0.1 2023-10-12 14:59:15,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1093381.3333333333, ans=0.0 2023-10-12 14:59:25,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1093428.0, ans=0.125 2023-10-12 14:59:27,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1093474.6666666667, ans=0.1 2023-10-12 14:59:37,343 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:59:41,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-12 14:59:41,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1093521.3333333333, ans=0.125 2023-10-12 14:59:53,583 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:59:57,661 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.08 vs. limit=10.0 2023-10-12 15:00:01,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093614.6666666667, ans=0.1 2023-10-12 15:00:12,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1093614.6666666667, ans=0.125 2023-10-12 15:00:13,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1093661.3333333333, ans=0.125 2023-10-12 15:00:18,004 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-10-12 15:00:30,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093708.0, ans=0.1 2023-10-12 15:00:33,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.770e+02 1.918e+02 2.258e+02 2.909e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 15:00:33,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1093708.0, ans=0.125 2023-10-12 15:00:34,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1093754.6666666667, ans=0.05 2023-10-12 15:00:38,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1093754.6666666667, ans=0.125 2023-10-12 15:00:38,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1093754.6666666667, ans=0.0 2023-10-12 15:00:47,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1093801.3333333333, ans=0.0 2023-10-12 15:01:07,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1093894.6666666667, ans=0.0 2023-10-12 15:01:19,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1093941.3333333333, ans=0.125 2023-10-12 15:01:43,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-12 15:02:09,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1094128.0, ans=0.125 2023-10-12 15:02:15,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094174.6666666667, ans=0.1 2023-10-12 15:02:21,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.758e+02 1.953e+02 2.167e+02 4.263e+02, threshold=3.905e+02, percent-clipped=1.0 2023-10-12 15:02:28,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1094221.3333333333, ans=0.1 2023-10-12 15:02:50,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.99 vs. limit=22.5 2023-10-12 15:03:01,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-12 15:03:09,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1094408.0, ans=0.1 2023-10-12 15:03:12,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1094408.0, ans=0.125 2023-10-12 15:03:39,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1094501.3333333333, ans=0.0 2023-10-12 15:03:46,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1094548.0, ans=0.1 2023-10-12 15:03:48,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1094548.0, ans=0.125 2023-10-12 15:03:58,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=5.83 vs. limit=15.0 2023-10-12 15:04:10,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.781e+02 1.923e+02 2.146e+02 3.420e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-12 15:04:12,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094688.0, ans=0.1 2023-10-12 15:04:14,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2023-10-12 15:04:23,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1094734.6666666667, ans=0.125 2023-10-12 15:04:25,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=15.0 2023-10-12 15:04:26,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1094734.6666666667, ans=0.125 2023-10-12 15:04:29,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-12 15:04:36,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1094781.3333333333, ans=0.125 2023-10-12 15:04:37,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.87 vs. limit=22.5 2023-10-12 15:04:47,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1094828.0, ans=10.0 2023-10-12 15:04:48,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1094828.0, ans=0.125 2023-10-12 15:04:53,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1094828.0, ans=0.025 2023-10-12 15:04:56,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1094874.6666666667, ans=0.0 2023-10-12 15:05:15,057 INFO [train.py:1031] (3/4) Epoch 18, batch 2500, loss[loss=0.2054, simple_loss=0.2861, pruned_loss=0.06234, over 16048.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2842, pruned_loss=0.05123, over 23388842.12 frames. ], batch size: 43, lr: 1.95e-03, grad_scale: 16.0 2023-10-12 15:05:18,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.32 vs. limit=22.5 2023-10-12 15:05:24,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=12.0 2023-10-12 15:05:33,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1095014.6666666667, ans=0.125 2023-10-12 15:05:35,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-12 15:05:35,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.59 vs. limit=10.0 2023-10-12 15:05:49,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1095108.0, ans=0.09899494936611666 2023-10-12 15:05:50,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.98 vs. limit=6.0 2023-10-12 15:05:57,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.823e+02 1.958e+02 2.161e+02 2.761e+02, threshold=3.915e+02, percent-clipped=0.0 2023-10-12 15:06:10,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1095201.3333333333, ans=0.125 2023-10-12 15:06:14,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1095201.3333333333, ans=0.04949747468305833 2023-10-12 15:06:17,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1095248.0, ans=0.0 2023-10-12 15:06:24,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1095248.0, ans=0.09899494936611666 2023-10-12 15:06:25,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1095248.0, ans=0.2 2023-10-12 15:06:27,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1095248.0, ans=0.2 2023-10-12 15:06:38,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095341.3333333333, ans=0.1 2023-10-12 15:06:55,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-10-12 15:07:01,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1095434.6666666667, ans=0.0 2023-10-12 15:07:14,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1095481.3333333333, ans=0.125 2023-10-12 15:07:23,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1095481.3333333333, ans=0.125 2023-10-12 15:07:32,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.60 vs. limit=5.0 2023-10-12 15:07:32,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1095528.0, ans=0.0 2023-10-12 15:07:47,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.787e+02 1.977e+02 2.203e+02 3.425e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-12 15:08:21,992 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:08:26,775 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:09:17,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1095994.6666666667, ans=0.125 2023-10-12 15:09:22,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1095994.6666666667, ans=0.2 2023-10-12 15:09:39,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.720e+02 1.848e+02 2.077e+02 3.247e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-12 15:10:02,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1096134.6666666667, ans=0.0 2023-10-12 15:10:22,153 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-10-12 15:10:25,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-12 15:10:32,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1096274.6666666667, ans=0.0 2023-10-12 15:10:47,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1096321.3333333333, ans=10.0 2023-10-12 15:10:51,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 15:11:06,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1096414.6666666667, ans=0.125 2023-10-12 15:11:11,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1096414.6666666667, ans=0.1 2023-10-12 15:11:40,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.692e+02 1.857e+02 2.098e+02 2.936e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 15:11:51,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1096601.3333333333, ans=0.125 2023-10-12 15:11:56,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1096601.3333333333, ans=0.0 2023-10-12 15:11:56,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=12.0 2023-10-12 15:12:30,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1096741.3333333333, ans=0.1 2023-10-12 15:12:36,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.71 vs. limit=10.0 2023-10-12 15:12:41,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1096741.3333333333, ans=0.125 2023-10-12 15:12:45,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.40 vs. limit=22.5 2023-10-12 15:12:52,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-10-12 15:13:10,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1096881.3333333333, ans=0.125 2023-10-12 15:13:25,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1096928.0, ans=0.125 2023-10-12 15:13:40,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1096974.6666666667, ans=0.125 2023-10-12 15:13:42,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.685e+02 1.929e+02 2.097e+02 2.599e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 15:13:54,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.35 vs. limit=22.5 2023-10-12 15:14:17,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1097161.3333333333, ans=0.025 2023-10-12 15:14:44,974 INFO [train.py:1031] (3/4) Epoch 18, batch 3000, loss[loss=0.19, simple_loss=0.2795, pruned_loss=0.05024, over 16895.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2834, pruned_loss=0.05115, over 25451327.36 frames. ], batch size: 130, lr: 1.95e-03, grad_scale: 16.0 2023-10-12 15:14:46,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1097301.3333333333, ans=0.1 2023-10-12 15:14:46,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=22.5 2023-10-12 15:14:53,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-12 15:14:59,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1097348.0, ans=0.0 2023-10-12 15:14:59,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097348.0, ans=0.1 2023-10-12 15:15:05,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1097348.0, ans=0.125 2023-10-12 15:15:27,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.735e+02 1.860e+02 2.086e+02 2.719e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 15:15:32,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1097488.0, ans=0.125 2023-10-12 15:15:38,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1097534.6666666667, ans=0.125 2023-10-12 15:15:54,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1097581.3333333333, ans=0.07 2023-10-12 15:16:03,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1097628.0, ans=0.125 2023-10-12 15:16:20,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1097674.6666666667, ans=0.0 2023-10-12 15:16:55,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1097814.6666666667, ans=0.125 2023-10-12 15:16:59,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1097814.6666666667, ans=0.125 2023-10-12 15:17:12,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1097861.3333333333, ans=0.0 2023-10-12 15:17:23,967 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:17:27,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.718e+02 1.888e+02 2.090e+02 3.235e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 15:17:45,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1098001.3333333333, ans=0.0 2023-10-12 15:18:06,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1098094.6666666667, ans=0.0 2023-10-12 15:18:15,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1098141.3333333333, ans=0.125 2023-10-12 15:18:15,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1098141.3333333333, ans=0.125 2023-10-12 15:18:34,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.90 vs. limit=6.0 2023-10-12 15:18:49,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1098281.3333333333, ans=0.125 2023-10-12 15:19:20,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1098374.6666666667, ans=0.2 2023-10-12 15:19:23,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.754e+02 1.912e+02 2.118e+02 2.880e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-12 15:19:28,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2023-10-12 15:19:34,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1098468.0, ans=0.125 2023-10-12 15:19:35,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.49 vs. limit=12.0 2023-10-12 15:19:36,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1098468.0, ans=0.125 2023-10-12 15:19:44,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1098468.0, ans=0.0 2023-10-12 15:20:02,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1098561.3333333333, ans=0.0 2023-10-12 15:20:09,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-10-12 15:20:17,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=12.0 2023-10-12 15:20:24,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1098654.6666666667, ans=0.125 2023-10-12 15:20:38,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1098701.3333333333, ans=0.09899494936611666 2023-10-12 15:20:45,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.95 vs. limit=15.0 2023-10-12 15:20:45,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1098748.0, ans=0.0 2023-10-12 15:20:57,551 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:21:10,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1098841.3333333333, ans=0.07 2023-10-12 15:21:14,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1098841.3333333333, ans=0.0 2023-10-12 15:21:17,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.805e+02 2.027e+02 2.270e+02 3.294e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-12 15:21:18,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1098888.0, ans=0.0 2023-10-12 15:21:19,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1098888.0, ans=0.2 2023-10-12 15:21:24,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1098888.0, ans=0.125 2023-10-12 15:21:43,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1098981.3333333333, ans=0.1 2023-10-12 15:21:53,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1099028.0, ans=0.0 2023-10-12 15:22:08,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-10-12 15:22:35,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099168.0, ans=0.1 2023-10-12 15:22:38,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1099214.6666666667, ans=0.125 2023-10-12 15:22:46,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099214.6666666667, ans=0.1 2023-10-12 15:22:53,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1099261.3333333333, ans=0.1 2023-10-12 15:23:11,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.736e+02 1.877e+02 2.090e+02 3.044e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-12 15:23:12,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1099354.6666666667, ans=0.2 2023-10-12 15:23:49,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1099494.6666666667, ans=0.125 2023-10-12 15:23:50,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=12.0 2023-10-12 15:24:07,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1099588.0, ans=0.125 2023-10-12 15:24:16,838 INFO [train.py:1031] (3/4) Epoch 18, batch 3500, loss[loss=0.1991, simple_loss=0.2961, pruned_loss=0.05101, over 16876.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2835, pruned_loss=0.05124, over 27109377.00 frames. ], batch size: 146, lr: 1.94e-03, grad_scale: 16.0 2023-10-12 15:24:21,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1099634.6666666667, ans=0.0 2023-10-12 15:24:21,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1099634.6666666667, ans=0.125 2023-10-12 15:24:34,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1099681.3333333333, ans=0.0 2023-10-12 15:24:47,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1099728.0, ans=0.125 2023-10-12 15:24:59,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.741e+02 1.896e+02 2.170e+02 3.188e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-12 15:25:01,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1099821.3333333333, ans=0.125 2023-10-12 15:25:23,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1099914.6666666667, ans=0.125 2023-10-12 15:25:25,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1099914.6666666667, ans=0.125 2023-10-12 15:25:30,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-12 15:25:32,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=12.0 2023-10-12 15:25:34,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.24 vs. limit=10.0 2023-10-12 15:25:43,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1099961.3333333333, ans=0.0 2023-10-12 15:25:54,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1100008.0, ans=0.0 2023-10-12 15:26:15,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.84 vs. limit=10.0 2023-10-12 15:26:28,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1100148.0, ans=0.125 2023-10-12 15:26:30,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1100148.0, ans=0.0 2023-10-12 15:26:31,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1100148.0, ans=0.125 2023-10-12 15:26:34,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100194.6666666667, ans=0.1 2023-10-12 15:26:38,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1100194.6666666667, ans=0.0 2023-10-12 15:26:46,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100241.3333333333, ans=0.1 2023-10-12 15:26:51,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1100241.3333333333, ans=0.0 2023-10-12 15:26:59,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.724e+02 1.916e+02 2.174e+02 3.079e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-12 15:27:42,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1100474.6666666667, ans=0.2 2023-10-12 15:28:04,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1100568.0, ans=0.2 2023-10-12 15:28:27,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1100661.3333333333, ans=0.125 2023-10-12 15:28:56,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.740e+02 1.882e+02 2.049e+02 2.993e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 15:29:19,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1100848.0, ans=0.0 2023-10-12 15:29:26,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.29 vs. limit=15.0 2023-10-12 15:29:48,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1100941.3333333333, ans=0.0 2023-10-12 15:29:57,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100988.0, ans=0.1 2023-10-12 15:30:01,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1100988.0, ans=0.125 2023-10-12 15:30:10,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1101034.6666666667, ans=0.0 2023-10-12 15:30:15,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1101081.3333333333, ans=0.0 2023-10-12 15:30:20,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1101081.3333333333, ans=12.0 2023-10-12 15:30:24,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1101081.3333333333, ans=0.125 2023-10-12 15:30:31,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1101128.0, ans=0.125 2023-10-12 15:30:41,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1101174.6666666667, ans=0.125 2023-10-12 15:30:41,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1101174.6666666667, ans=0.125 2023-10-12 15:30:48,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.723e+02 1.873e+02 2.043e+02 2.937e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-12 15:31:02,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1101268.0, ans=0.125 2023-10-12 15:31:21,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1101361.3333333333, ans=0.2 2023-10-12 15:31:25,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1101361.3333333333, ans=0.125 2023-10-12 15:31:38,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1101408.0, ans=0.1 2023-10-12 15:32:04,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1101548.0, ans=0.0 2023-10-12 15:32:08,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1101548.0, ans=0.125 2023-10-12 15:32:14,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=22.5 2023-10-12 15:32:22,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-12 15:32:38,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.687e+02 1.851e+02 2.056e+02 3.533e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-12 15:32:39,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1101688.0, ans=0.0 2023-10-12 15:32:59,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1101781.3333333333, ans=0.2 2023-10-12 15:33:05,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.79 vs. limit=15.0 2023-10-12 15:33:09,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1101828.0, ans=0.125 2023-10-12 15:33:18,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.74 vs. limit=15.0 2023-10-12 15:33:44,670 INFO [train.py:1031] (3/4) Epoch 18, batch 4000, loss[loss=0.1918, simple_loss=0.2597, pruned_loss=0.06197, over 12514.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2833, pruned_loss=0.0515, over 28334030.28 frames. ], batch size: 440, lr: 1.94e-03, grad_scale: 32.0 2023-10-12 15:34:10,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-12 15:34:23,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1102108.0, ans=0.0 2023-10-12 15:34:25,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1102108.0, ans=0.0 2023-10-12 15:34:32,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.729e+02 1.841e+02 2.139e+02 3.085e+02, threshold=3.682e+02, percent-clipped=0.0 2023-10-12 15:34:43,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1102201.3333333333, ans=0.0 2023-10-12 15:35:07,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1102294.6666666667, ans=0.125 2023-10-12 15:35:14,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.08 vs. limit=15.0 2023-10-12 15:35:55,528 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:36:03,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1102528.0, ans=0.0 2023-10-12 15:36:12,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.79 vs. limit=22.5 2023-10-12 15:36:25,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.820e+02 1.969e+02 2.213e+02 3.303e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-12 15:36:30,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1102621.3333333333, ans=0.125 2023-10-12 15:36:40,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-12 15:36:56,272 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:37:00,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1102761.3333333333, ans=0.125 2023-10-12 15:37:16,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1102761.3333333333, ans=0.0 2023-10-12 15:37:23,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1102808.0, ans=0.05 2023-10-12 15:37:38,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.22 vs. limit=10.0 2023-10-12 15:37:45,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1102901.3333333333, ans=0.015 2023-10-12 15:37:58,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102948.0, ans=0.1 2023-10-12 15:38:12,245 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:38:30,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.717e+02 1.855e+02 2.118e+02 2.850e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-12 15:39:04,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1103228.0, ans=0.0 2023-10-12 15:39:05,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1103228.0, ans=0.0 2023-10-12 15:39:09,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1103274.6666666667, ans=0.0 2023-10-12 15:39:21,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1103321.3333333333, ans=0.2 2023-10-12 15:39:28,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1103321.3333333333, ans=0.125 2023-10-12 15:39:35,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1103368.0, ans=0.0 2023-10-12 15:40:11,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.39 vs. limit=15.0 2023-10-12 15:40:17,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.727e+02 1.887e+02 2.055e+02 2.803e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-12 15:40:20,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1103554.6666666667, ans=0.125 2023-10-12 15:40:35,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1103648.0, ans=0.1 2023-10-12 15:41:09,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1103741.3333333333, ans=10.0 2023-10-12 15:41:16,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1103788.0, ans=0.1 2023-10-12 15:41:21,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1103788.0, ans=0.125 2023-10-12 15:41:49,623 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:41:49,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1103928.0, ans=0.0 2023-10-12 15:41:59,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1103974.6666666667, ans=0.1 2023-10-12 15:41:59,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2023-10-12 15:42:17,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.799e+02 1.961e+02 2.144e+02 2.702e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 15:42:32,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1104068.0, ans=0.125 2023-10-12 15:42:36,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1104068.0, ans=0.125 2023-10-12 15:42:57,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-10-12 15:43:11,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1104208.0, ans=0.0 2023-10-12 15:43:25,058 INFO [train.py:1031] (3/4) Epoch 18, batch 4500, loss[loss=0.1927, simple_loss=0.2772, pruned_loss=0.05411, over 16855.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2835, pruned_loss=0.05124, over 29331961.28 frames. ], batch size: 110, lr: 1.94e-03, grad_scale: 32.0 2023-10-12 15:43:32,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104301.3333333333, ans=0.1 2023-10-12 15:43:44,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1104348.0, ans=0.04949747468305833 2023-10-12 15:44:07,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.62 vs. limit=10.0 2023-10-12 15:44:11,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1104488.0, ans=0.125 2023-10-12 15:44:13,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.752e+02 1.856e+02 2.073e+02 2.905e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-12 15:44:30,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.65 vs. limit=12.0 2023-10-12 15:44:30,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1104581.3333333333, ans=0.0 2023-10-12 15:44:38,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1104581.3333333333, ans=0.0 2023-10-12 15:44:38,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.93 vs. limit=22.5 2023-10-12 15:44:47,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1104628.0, ans=0.125 2023-10-12 15:44:57,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1104674.6666666667, ans=0.125 2023-10-12 15:45:06,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1104721.3333333333, ans=0.1 2023-10-12 15:45:17,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1104768.0, ans=0.1 2023-10-12 15:45:26,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1104814.6666666667, ans=0.125 2023-10-12 15:45:27,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1104814.6666666667, ans=0.0 2023-10-12 15:45:31,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-10-12 15:45:39,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1104861.3333333333, ans=0.0 2023-10-12 15:45:51,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1104908.0, ans=0.125 2023-10-12 15:45:57,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.805e+02 2.037e+02 2.265e+02 3.346e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-12 15:46:07,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1105001.3333333333, ans=0.0 2023-10-12 15:46:33,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1105094.6666666667, ans=0.2 2023-10-12 15:46:41,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-10-12 15:46:56,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1105188.0, ans=0.125 2023-10-12 15:47:12,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1105281.3333333333, ans=12.0 2023-10-12 15:47:20,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.53 vs. limit=5.0 2023-10-12 15:47:31,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1105374.6666666667, ans=0.125 2023-10-12 15:47:39,022 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-10-12 15:47:39,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1105374.6666666667, ans=0.2 2023-10-12 15:47:44,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.736e+02 1.986e+02 2.156e+02 3.327e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-12 15:47:45,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2023-10-12 15:47:52,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=1105468.0, ans=15.0 2023-10-12 15:48:01,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1105514.6666666667, ans=0.125 2023-10-12 15:48:10,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1105514.6666666667, ans=0.125 2023-10-12 15:48:32,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1105654.6666666667, ans=0.125 2023-10-12 15:48:38,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-12 15:48:48,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1105701.3333333333, ans=0.0 2023-10-12 15:48:50,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1105701.3333333333, ans=0.05 2023-10-12 15:49:10,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1105794.6666666667, ans=0.125 2023-10-12 15:49:30,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1105841.3333333333, ans=0.125 2023-10-12 15:49:37,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.739e+02 1.981e+02 2.234e+02 3.210e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-12 15:49:43,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1105934.6666666667, ans=0.0 2023-10-12 15:50:02,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2023-10-12 15:50:05,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1106028.0, ans=0.0 2023-10-12 15:50:16,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1106074.6666666667, ans=0.0 2023-10-12 15:50:22,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1106074.6666666667, ans=0.125 2023-10-12 15:50:28,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1106121.3333333333, ans=0.125 2023-10-12 15:50:29,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1106121.3333333333, ans=0.125 2023-10-12 15:50:31,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1106121.3333333333, ans=0.125 2023-10-12 15:50:35,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1106121.3333333333, ans=0.0 2023-10-12 15:50:44,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1106168.0, ans=15.0 2023-10-12 15:50:46,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1106168.0, ans=0.0 2023-10-12 15:50:54,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106214.6666666667, ans=0.125 2023-10-12 15:51:01,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1106261.3333333333, ans=0.0 2023-10-12 15:51:24,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1106308.0, ans=10.0 2023-10-12 15:51:30,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1106354.6666666667, ans=0.125 2023-10-12 15:51:33,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.758e+02 1.897e+02 2.168e+02 3.565e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 15:51:37,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106354.6666666667, ans=0.1 2023-10-12 15:52:12,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1106541.3333333333, ans=0.125 2023-10-12 15:52:21,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1106588.0, ans=0.125 2023-10-12 15:52:29,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106588.0, ans=0.1 2023-10-12 15:52:33,406 INFO [train.py:1031] (3/4) Epoch 18, batch 5000, loss[loss=0.1974, simple_loss=0.282, pruned_loss=0.05637, over 16652.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2833, pruned_loss=0.05139, over 30103399.24 frames. ], batch size: 56, lr: 1.94e-03, grad_scale: 16.0 2023-10-12 15:52:53,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=15.0 2023-10-12 15:53:05,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1106774.6666666667, ans=0.125 2023-10-12 15:53:22,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.721e+02 1.945e+02 2.208e+02 3.554e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-12 15:53:45,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1106914.6666666667, ans=0.07 2023-10-12 15:53:50,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2023-10-12 15:54:22,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=15.0 2023-10-12 15:54:25,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.09 vs. limit=10.0 2023-10-12 15:54:29,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-10-12 15:54:32,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1107101.3333333333, ans=0.025 2023-10-12 15:54:42,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1107148.0, ans=0.0 2023-10-12 15:54:50,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107194.6666666667, ans=0.1 2023-10-12 15:54:55,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1107194.6666666667, ans=0.07 2023-10-12 15:55:03,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1107241.3333333333, ans=0.125 2023-10-12 15:55:04,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1107241.3333333333, ans=0.125 2023-10-12 15:55:11,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.741e+02 1.903e+02 2.102e+02 3.010e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-12 15:55:13,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1107288.0, ans=0.125 2023-10-12 15:55:25,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1107334.6666666667, ans=0.2 2023-10-12 15:55:31,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1107381.3333333333, ans=0.0 2023-10-12 15:55:50,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1107474.6666666667, ans=0.025 2023-10-12 15:56:02,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1107521.3333333333, ans=0.125 2023-10-12 15:56:15,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-10-12 15:56:42,490 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:56:47,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1107708.0, ans=0.0 2023-10-12 15:56:52,923 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:56:56,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.830e+02 2.006e+02 2.312e+02 2.939e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-12 15:56:58,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1107754.6666666667, ans=0.1 2023-10-12 15:56:59,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1107754.6666666667, ans=0.0 2023-10-12 15:57:04,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1107801.3333333333, ans=0.2 2023-10-12 15:57:07,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107801.3333333333, ans=0.1 2023-10-12 15:57:14,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1107801.3333333333, ans=0.125 2023-10-12 15:57:14,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1107801.3333333333, ans=0.09899494936611666 2023-10-12 15:58:12,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1108081.3333333333, ans=0.125 2023-10-12 15:58:21,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-10-12 15:58:22,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1108128.0, ans=0.0 2023-10-12 15:58:24,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108128.0, ans=0.1 2023-10-12 15:58:24,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-12 15:58:26,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108128.0, ans=0.1 2023-10-12 15:58:26,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1108128.0, ans=0.125 2023-10-12 15:58:29,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2023-10-12 15:58:40,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-10-12 15:58:45,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1108174.6666666667, ans=0.05 2023-10-12 15:58:46,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1108174.6666666667, ans=0.07 2023-10-12 15:58:53,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108221.3333333333, ans=0.1 2023-10-12 15:58:53,164 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:58:55,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.654e+02 1.822e+02 2.013e+02 3.284e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-12 15:58:55,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1108221.3333333333, ans=0.125 2023-10-12 15:58:56,526 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:59:12,280 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:59:28,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1108361.3333333333, ans=0.125 2023-10-12 15:59:36,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108408.0, ans=0.1 2023-10-12 15:59:37,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1108408.0, ans=0.125 2023-10-12 15:59:41,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1108408.0, ans=0.125 2023-10-12 15:59:45,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1108454.6666666667, ans=0.0 2023-10-12 15:59:49,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1108454.6666666667, ans=0.0 2023-10-12 16:00:00,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-12 16:00:40,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.663e+02 1.798e+02 1.970e+02 2.793e+02, threshold=3.596e+02, percent-clipped=0.0 2023-10-12 16:00:42,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1108688.0, ans=0.05 2023-10-12 16:00:50,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1108734.6666666667, ans=0.2 2023-10-12 16:00:52,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1108734.6666666667, ans=0.1 2023-10-12 16:01:21,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1108874.6666666667, ans=0.125 2023-10-12 16:01:29,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=12.0 2023-10-12 16:01:30,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1108921.3333333333, ans=0.125 2023-10-12 16:01:39,406 INFO [train.py:1031] (3/4) Epoch 18, batch 5500, loss[loss=0.1731, simple_loss=0.2699, pruned_loss=0.03816, over 16932.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2833, pruned_loss=0.05138, over 30702639.62 frames. ], batch size: 165, lr: 1.94e-03, grad_scale: 16.0 2023-10-12 16:02:02,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1109061.3333333333, ans=0.0 2023-10-12 16:02:22,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1109154.6666666667, ans=0.125 2023-10-12 16:02:25,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.790e+02 1.968e+02 2.182e+02 3.087e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-12 16:02:25,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-10-12 16:02:52,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1109294.6666666667, ans=0.125 2023-10-12 16:02:53,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1109294.6666666667, ans=0.0 2023-10-12 16:03:03,540 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.22 vs. limit=15.0 2023-10-12 16:03:07,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1109341.3333333333, ans=0.125 2023-10-12 16:03:12,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-10-12 16:03:16,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1109388.0, ans=0.125 2023-10-12 16:03:18,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1109388.0, ans=0.125 2023-10-12 16:03:18,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1109388.0, ans=0.125 2023-10-12 16:03:32,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1109434.6666666667, ans=0.0 2023-10-12 16:03:38,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109481.3333333333, ans=0.1 2023-10-12 16:03:45,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1109481.3333333333, ans=0.125 2023-10-12 16:03:58,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2023-10-12 16:03:58,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1109528.0, ans=0.125 2023-10-12 16:03:58,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1109528.0, ans=0.125 2023-10-12 16:04:09,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1109621.3333333333, ans=0.125 2023-10-12 16:04:10,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1109621.3333333333, ans=0.0 2023-10-12 16:04:16,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.730e+02 1.930e+02 2.156e+02 3.440e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 16:04:19,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1109621.3333333333, ans=0.0 2023-10-12 16:04:24,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1109668.0, ans=0.0 2023-10-12 16:04:33,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1109714.6666666667, ans=0.125 2023-10-12 16:04:50,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=8.0 2023-10-12 16:04:52,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=22.5 2023-10-12 16:05:49,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.60 vs. limit=10.0 2023-10-12 16:06:08,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.782e+02 1.937e+02 2.101e+02 2.805e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 16:06:21,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1110134.6666666667, ans=0.0 2023-10-12 16:06:22,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1110134.6666666667, ans=0.125 2023-10-12 16:06:44,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.17 vs. limit=15.0 2023-10-12 16:06:56,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1110274.6666666667, ans=0.0 2023-10-12 16:06:57,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110274.6666666667, ans=0.1 2023-10-12 16:07:03,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1110321.3333333333, ans=0.05 2023-10-12 16:07:05,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1110321.3333333333, ans=0.0 2023-10-12 16:07:06,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1110321.3333333333, ans=0.125 2023-10-12 16:07:07,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1110321.3333333333, ans=0.035 2023-10-12 16:07:12,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-12 16:07:22,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1110368.0, ans=0.125 2023-10-12 16:07:24,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1110414.6666666667, ans=0.0 2023-10-12 16:07:27,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-12 16:07:31,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2023-10-12 16:07:35,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1110461.3333333333, ans=0.125 2023-10-12 16:07:44,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1110461.3333333333, ans=0.125 2023-10-12 16:07:54,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1110508.0, ans=0.2 2023-10-12 16:08:01,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1110554.6666666667, ans=0.125 2023-10-12 16:08:02,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.65 vs. limit=22.5 2023-10-12 16:08:04,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.755e+02 1.935e+02 2.257e+02 2.969e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 16:08:43,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1110694.6666666667, ans=0.05 2023-10-12 16:08:44,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110694.6666666667, ans=0.1 2023-10-12 16:08:47,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1110694.6666666667, ans=0.0 2023-10-12 16:09:02,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1110788.0, ans=0.0 2023-10-12 16:09:05,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1110788.0, ans=0.125 2023-10-12 16:09:11,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1110788.0, ans=10.0 2023-10-12 16:09:30,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1110881.3333333333, ans=0.125 2023-10-12 16:09:37,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-10-12 16:09:58,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1110974.6666666667, ans=0.09899494936611666 2023-10-12 16:10:10,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.682e+02 1.873e+02 2.115e+02 2.878e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-12 16:10:17,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.29 vs. limit=10.0 2023-10-12 16:10:17,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1111068.0, ans=0.0 2023-10-12 16:10:18,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1111068.0, ans=0.125 2023-10-12 16:10:23,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1111068.0, ans=0.0 2023-10-12 16:10:24,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111068.0, ans=0.1 2023-10-12 16:10:48,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1111208.0, ans=0.125 2023-10-12 16:11:09,056 INFO [train.py:1031] (3/4) Epoch 18, batch 6000, loss[loss=0.22, simple_loss=0.2744, pruned_loss=0.08279, over 12706.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2837, pruned_loss=0.05164, over 31171206.05 frames. ], batch size: 440, lr: 1.93e-03, grad_scale: 32.0 2023-10-12 16:11:27,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1111348.0, ans=0.0 2023-10-12 16:11:40,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1111394.6666666667, ans=0.2 2023-10-12 16:11:49,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1111441.3333333333, ans=0.125 2023-10-12 16:11:50,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1111441.3333333333, ans=0.2 2023-10-12 16:11:54,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1111441.3333333333, ans=0.0 2023-10-12 16:12:06,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.733e+02 1.952e+02 2.091e+02 4.505e+02, threshold=3.904e+02, percent-clipped=1.0 2023-10-12 16:12:07,957 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=22.5 2023-10-12 16:12:32,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.05 vs. limit=22.5 2023-10-12 16:13:02,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.08 vs. limit=15.0 2023-10-12 16:13:09,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=15.0 2023-10-12 16:13:50,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1111954.6666666667, ans=0.2 2023-10-12 16:13:55,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.749e+02 1.858e+02 2.080e+02 3.062e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 16:14:18,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1112048.0, ans=10.0 2023-10-12 16:14:23,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1112094.6666666667, ans=0.125 2023-10-12 16:14:35,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1112094.6666666667, ans=0.1 2023-10-12 16:14:41,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2023-10-12 16:14:47,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1112188.0, ans=0.125 2023-10-12 16:14:59,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.29 vs. limit=10.0 2023-10-12 16:15:24,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1112328.0, ans=0.0 2023-10-12 16:15:40,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1112374.6666666667, ans=0.125 2023-10-12 16:15:41,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.83 vs. limit=15.0 2023-10-12 16:15:52,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.761e+02 1.901e+02 2.096e+02 3.156e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-12 16:16:03,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1112468.0, ans=0.125 2023-10-12 16:16:04,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1112468.0, ans=0.0 2023-10-12 16:16:14,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.14 vs. limit=10.0 2023-10-12 16:16:17,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1112514.6666666667, ans=0.125 2023-10-12 16:16:22,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1112561.3333333333, ans=0.125 2023-10-12 16:16:25,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1112561.3333333333, ans=0.2 2023-10-12 16:16:26,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1112561.3333333333, ans=0.2 2023-10-12 16:16:37,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1112608.0, ans=0.0 2023-10-12 16:16:40,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1112608.0, ans=0.125 2023-10-12 16:16:43,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1112654.6666666667, ans=0.125 2023-10-12 16:16:43,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1112654.6666666667, ans=0.2 2023-10-12 16:16:51,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1112654.6666666667, ans=0.125 2023-10-12 16:17:02,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1112701.3333333333, ans=0.125 2023-10-12 16:17:03,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.00 vs. limit=22.5 2023-10-12 16:17:22,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1112794.6666666667, ans=0.2 2023-10-12 16:17:35,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=15.0 2023-10-12 16:17:46,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.818e+02 2.008e+02 2.184e+02 2.882e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-12 16:17:49,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1112888.0, ans=0.1 2023-10-12 16:18:01,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112934.6666666667, ans=0.125 2023-10-12 16:18:34,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1113074.6666666667, ans=0.125 2023-10-12 16:18:41,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1113121.3333333333, ans=0.125 2023-10-12 16:18:49,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1113121.3333333333, ans=0.125 2023-10-12 16:18:50,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1113121.3333333333, ans=0.125 2023-10-12 16:19:16,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1113261.3333333333, ans=0.2 2023-10-12 16:19:19,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-10-12 16:19:29,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1113308.0, ans=0.0 2023-10-12 16:19:34,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1113308.0, ans=0.0 2023-10-12 16:19:41,982 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-10-12 16:19:44,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.749e+02 1.947e+02 2.162e+02 3.496e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-12 16:19:53,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-10-12 16:19:58,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1113448.0, ans=0.125 2023-10-12 16:20:01,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.33 vs. limit=15.0 2023-10-12 16:20:09,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1113494.6666666667, ans=0.125 2023-10-12 16:20:24,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1113541.3333333333, ans=0.125 2023-10-12 16:20:40,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2023-10-12 16:20:46,816 INFO [train.py:1031] (3/4) Epoch 18, batch 6500, loss[loss=0.1844, simple_loss=0.2794, pruned_loss=0.04465, over 16908.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.284, pruned_loss=0.0516, over 31527942.91 frames. ], batch size: 98, lr: 1.93e-03, grad_scale: 32.0 2023-10-12 16:20:54,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1113634.6666666667, ans=0.2 2023-10-12 16:21:22,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1113728.0, ans=0.125 2023-10-12 16:21:29,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-10-12 16:21:49,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.778e+02 1.940e+02 2.143e+02 3.146e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 16:21:50,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1113821.3333333333, ans=0.5 2023-10-12 16:21:58,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1113868.0, ans=0.1 2023-10-12 16:22:20,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1113961.3333333333, ans=0.1 2023-10-12 16:22:36,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1114008.0, ans=0.125 2023-10-12 16:23:33,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1114288.0, ans=0.0 2023-10-12 16:23:38,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1114288.0, ans=0.125 2023-10-12 16:23:38,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.793e+02 1.979e+02 2.241e+02 3.054e+02, threshold=3.958e+02, percent-clipped=0.0 2023-10-12 16:23:43,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1114334.6666666667, ans=0.125 2023-10-12 16:23:47,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-10-12 16:24:04,458 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:24:11,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-10-12 16:24:17,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=15.0 2023-10-12 16:24:24,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1114474.6666666667, ans=0.125 2023-10-12 16:24:29,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1114521.3333333333, ans=0.0 2023-10-12 16:24:40,254 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.25 vs. limit=15.0 2023-10-12 16:25:13,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114708.0, ans=0.1 2023-10-12 16:25:20,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1114754.6666666667, ans=0.1 2023-10-12 16:25:24,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.75 vs. limit=10.0 2023-10-12 16:25:29,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.676e+02 1.862e+02 2.166e+02 3.279e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-12 16:26:00,376 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=15.0 2023-10-12 16:26:43,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1115034.6666666667, ans=0.125 2023-10-12 16:26:53,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1115081.3333333333, ans=0.125 2023-10-12 16:27:04,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1115081.3333333333, ans=0.125 2023-10-12 16:27:10,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.22 vs. limit=15.0 2023-10-12 16:27:39,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.692e+02 1.885e+02 2.165e+02 2.971e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-12 16:28:05,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1115361.3333333333, ans=0.2 2023-10-12 16:28:06,022 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:28:55,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.65 vs. limit=15.0 2023-10-12 16:28:59,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1115594.6666666667, ans=0.0 2023-10-12 16:29:00,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1115594.6666666667, ans=0.125 2023-10-12 16:29:00,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1115594.6666666667, ans=0.125 2023-10-12 16:29:02,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1115594.6666666667, ans=0.0 2023-10-12 16:29:02,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-10-12 16:29:09,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1115641.3333333333, ans=0.125 2023-10-12 16:29:14,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1115641.3333333333, ans=0.125 2023-10-12 16:29:17,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1115641.3333333333, ans=0.1 2023-10-12 16:29:29,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.746e+02 1.923e+02 2.163e+02 2.844e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-12 16:29:29,346 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:29:33,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1115734.6666666667, ans=0.125 2023-10-12 16:29:34,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1115734.6666666667, ans=0.125 2023-10-12 16:29:50,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1115781.3333333333, ans=0.1 2023-10-12 16:29:51,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1115828.0, ans=0.1 2023-10-12 16:29:52,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.48 vs. limit=15.0 2023-10-12 16:30:00,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1115828.0, ans=0.125 2023-10-12 16:30:00,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1115828.0, ans=0.1 2023-10-12 16:30:08,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1115874.6666666667, ans=0.0 2023-10-12 16:30:16,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1115921.3333333333, ans=0.125 2023-10-12 16:30:20,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1115921.3333333333, ans=0.125 2023-10-12 16:30:23,573 INFO [train.py:1031] (3/4) Epoch 18, batch 7000, loss[loss=0.1729, simple_loss=0.2745, pruned_loss=0.0357, over 16924.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2844, pruned_loss=0.05133, over 31827352.69 frames. ], batch size: 87, lr: 1.93e-03, grad_scale: 16.0 2023-10-12 16:30:25,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1115968.0, ans=0.125 2023-10-12 16:30:55,084 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.28 vs. limit=5.0 2023-10-12 16:31:06,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1116014.6666666667, ans=0.125 2023-10-12 16:31:12,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1116061.3333333333, ans=0.0 2023-10-12 16:31:15,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.63 vs. limit=15.0 2023-10-12 16:31:26,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1116108.0, ans=0.125 2023-10-12 16:31:26,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.86 vs. limit=15.0 2023-10-12 16:31:41,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1116154.6666666667, ans=0.125 2023-10-12 16:31:44,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.759e+02 1.878e+02 2.079e+02 2.701e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-12 16:31:45,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1116154.6666666667, ans=0.0 2023-10-12 16:31:55,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.16 vs. limit=22.5 2023-10-12 16:32:26,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1116294.6666666667, ans=0.125 2023-10-12 16:32:54,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1116434.6666666667, ans=0.125 2023-10-12 16:32:58,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1116434.6666666667, ans=0.0 2023-10-12 16:33:05,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1116481.3333333333, ans=0.0 2023-10-12 16:33:05,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1116481.3333333333, ans=0.125 2023-10-12 16:33:06,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1116481.3333333333, ans=0.0 2023-10-12 16:33:07,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1116481.3333333333, ans=0.0 2023-10-12 16:33:23,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1116528.0, ans=0.0 2023-10-12 16:33:23,167 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:33:40,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1116574.6666666667, ans=0.0 2023-10-12 16:33:46,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1116621.3333333333, ans=0.125 2023-10-12 16:33:46,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1116621.3333333333, ans=0.125 2023-10-12 16:33:51,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.828e+02 1.944e+02 2.126e+02 2.908e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-12 16:33:57,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1116668.0, ans=0.125 2023-10-12 16:34:05,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1116668.0, ans=0.025 2023-10-12 16:34:08,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1116714.6666666667, ans=0.0 2023-10-12 16:34:17,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1116714.6666666667, ans=0.2 2023-10-12 16:34:21,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=15.0 2023-10-12 16:34:34,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1116808.0, ans=0.125 2023-10-12 16:34:48,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1116854.6666666667, ans=0.125 2023-10-12 16:34:54,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1116901.3333333333, ans=0.2 2023-10-12 16:35:18,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116948.0, ans=0.1 2023-10-12 16:35:18,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.73 vs. limit=15.0 2023-10-12 16:35:23,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.28 vs. limit=15.0 2023-10-12 16:35:54,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-12 16:35:59,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.714e+02 1.917e+02 2.065e+02 2.614e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-12 16:36:08,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2023-10-12 16:37:11,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1117368.0, ans=0.125 2023-10-12 16:37:17,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1117368.0, ans=0.125 2023-10-12 16:37:35,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1117461.3333333333, ans=0.125 2023-10-12 16:37:37,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.26 vs. limit=22.5 2023-10-12 16:37:54,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1117508.0, ans=0.125 2023-10-12 16:38:06,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1117554.6666666667, ans=0.0 2023-10-12 16:38:11,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.732e+02 1.887e+02 2.175e+02 5.635e+02, threshold=3.774e+02, percent-clipped=1.0 2023-10-12 16:38:15,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1117601.3333333333, ans=0.2 2023-10-12 16:38:25,792 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-10-12 16:38:48,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1117741.3333333333, ans=22.5 2023-10-12 16:39:03,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1117788.0, ans=0.0 2023-10-12 16:39:05,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1117788.0, ans=0.125 2023-10-12 16:39:08,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1117834.6666666667, ans=0.125 2023-10-12 16:39:15,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1117881.3333333333, ans=0.125 2023-10-12 16:39:40,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1117974.6666666667, ans=0.0 2023-10-12 16:39:56,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.14 vs. limit=10.0 2023-10-12 16:39:58,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.794e+02 1.973e+02 2.304e+02 3.131e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-12 16:40:09,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-12 16:40:21,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1118161.3333333333, ans=0.125 2023-10-12 16:40:38,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1118208.0, ans=0.125 2023-10-12 16:40:52,998 INFO [train.py:1031] (3/4) Epoch 18, batch 7500, loss[loss=0.1966, simple_loss=0.2871, pruned_loss=0.05307, over 16947.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.284, pruned_loss=0.0512, over 32036245.21 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 8.0 2023-10-12 16:40:55,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2023-10-12 16:41:37,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-12 16:41:48,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.851e+02 2.037e+02 2.314e+02 2.738e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-12 16:42:00,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1118581.3333333333, ans=0.125 2023-10-12 16:42:06,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1118581.3333333333, ans=0.0 2023-10-12 16:42:11,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1118628.0, ans=0.1 2023-10-12 16:42:27,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1118674.6666666667, ans=0.05 2023-10-12 16:42:34,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1118721.3333333333, ans=0.125 2023-10-12 16:43:00,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1118814.6666666667, ans=0.125 2023-10-12 16:43:08,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-10-12 16:43:28,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.43 vs. limit=10.0 2023-10-12 16:43:48,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1118954.6666666667, ans=0.125 2023-10-12 16:43:50,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1118954.6666666667, ans=0.0 2023-10-12 16:43:51,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1118954.6666666667, ans=0.0 2023-10-12 16:43:52,208 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-10-12 16:43:53,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118954.6666666667, ans=0.1 2023-10-12 16:43:55,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.691e+02 1.888e+02 2.199e+02 3.249e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 16:43:55,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1119001.3333333333, ans=0.125 2023-10-12 16:43:58,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-12 16:44:06,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1119048.0, ans=0.0 2023-10-12 16:44:14,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1119048.0, ans=0.07 2023-10-12 16:44:25,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1119094.6666666667, ans=0.125 2023-10-12 16:44:38,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-12 16:44:38,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1119188.0, ans=0.125 2023-10-12 16:44:43,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1119188.0, ans=0.125 2023-10-12 16:44:44,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1119188.0, ans=0.125 2023-10-12 16:44:51,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.03 vs. limit=22.5 2023-10-12 16:45:49,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.24 vs. limit=10.0 2023-10-12 16:45:49,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1119468.0, ans=0.125 2023-10-12 16:45:49,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1119468.0, ans=0.125 2023-10-12 16:45:50,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.704e+02 1.865e+02 2.081e+02 2.789e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-12 16:45:54,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1119468.0, ans=0.07 2023-10-12 16:46:12,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1119514.6666666667, ans=0.125 2023-10-12 16:46:29,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1119561.3333333333, ans=0.0 2023-10-12 16:46:35,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1119608.0, ans=0.0 2023-10-12 16:46:57,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1119701.3333333333, ans=0.0 2023-10-12 16:46:58,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1119701.3333333333, ans=0.1 2023-10-12 16:48:00,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1119794.6666666667, ans=0.0 2023-10-12 16:48:27,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.774e+02 1.967e+02 2.159e+02 2.955e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-12 16:49:07,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1120074.6666666667, ans=0.0 2023-10-12 16:49:12,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=22.5 2023-10-12 16:49:47,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1120214.6666666667, ans=0.035 2023-10-12 16:49:47,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1120214.6666666667, ans=0.125 2023-10-12 16:49:54,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1120261.3333333333, ans=0.125 2023-10-12 16:50:03,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1120261.3333333333, ans=0.2 2023-10-12 16:50:05,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1120308.0, ans=0.0 2023-10-12 16:50:12,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1120308.0, ans=0.125 2023-10-12 16:50:25,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1120354.6666666667, ans=0.0 2023-10-12 16:50:28,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1120401.3333333333, ans=0.1 2023-10-12 16:50:29,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.702e+02 1.864e+02 2.163e+02 3.610e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-12 16:51:10,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=15.0 2023-10-12 16:51:12,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1120541.3333333333, ans=0.125 2023-10-12 16:51:15,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.38 vs. limit=22.5 2023-10-12 16:51:31,058 INFO [train.py:1031] (3/4) Epoch 18, batch 8000, loss[loss=0.1829, simple_loss=0.2737, pruned_loss=0.04603, over 16572.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2835, pruned_loss=0.05069, over 32216160.26 frames. ], batch size: 219, lr: 1.93e-03, grad_scale: 32.0 2023-10-12 16:51:33,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1120634.6666666667, ans=0.0 2023-10-12 16:51:41,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-10-12 16:51:44,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1120681.3333333333, ans=0.0 2023-10-12 16:51:52,299 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.38 vs. limit=15.0 2023-10-12 16:51:56,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1120728.0, ans=0.2 2023-10-12 16:52:02,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1120774.6666666667, ans=0.125 2023-10-12 16:52:10,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1120774.6666666667, ans=0.125 2023-10-12 16:52:27,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.683e+02 1.859e+02 2.109e+02 3.287e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 16:52:35,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=12.0 2023-10-12 16:52:38,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1120914.6666666667, ans=10.0 2023-10-12 16:52:54,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1120961.3333333333, ans=0.0 2023-10-12 16:52:56,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1120961.3333333333, ans=0.125 2023-10-12 16:52:59,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1120961.3333333333, ans=0.2 2023-10-12 16:53:07,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1121008.0, ans=0.125 2023-10-12 16:53:09,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-10-12 16:53:14,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121054.6666666667, ans=0.1 2023-10-12 16:53:47,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.71 vs. limit=6.0 2023-10-12 16:54:03,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1121241.3333333333, ans=0.125 2023-10-12 16:54:26,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.729e+02 1.866e+02 2.032e+02 2.964e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-12 16:54:27,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1121334.6666666667, ans=0.125 2023-10-12 16:54:47,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1121381.3333333333, ans=0.09899494936611666 2023-10-12 16:54:51,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1121381.3333333333, ans=0.2 2023-10-12 16:55:07,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1121428.0, ans=0.07 2023-10-12 16:55:09,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1121428.0, ans=0.0 2023-10-12 16:55:16,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1121474.6666666667, ans=0.125 2023-10-12 16:55:19,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.22 vs. limit=22.5 2023-10-12 16:55:36,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1121568.0, ans=0.125 2023-10-12 16:55:59,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1121614.6666666667, ans=0.02 2023-10-12 16:55:59,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1121614.6666666667, ans=0.125 2023-10-12 16:56:27,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1121754.6666666667, ans=0.125 2023-10-12 16:56:43,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.676e+02 1.814e+02 2.076e+02 2.729e+02, threshold=3.627e+02, percent-clipped=0.0 2023-10-12 16:57:39,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1121988.0, ans=0.125 2023-10-12 16:57:56,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1122034.6666666667, ans=0.125 2023-10-12 16:58:17,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1122128.0, ans=0.125 2023-10-12 16:58:17,380 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:58:24,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1122174.6666666667, ans=0.2 2023-10-12 16:58:52,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.736e+02 1.909e+02 2.151e+02 3.635e+02, threshold=3.819e+02, percent-clipped=1.0 2023-10-12 16:59:03,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1122314.6666666667, ans=0.0 2023-10-12 16:59:26,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1122408.0, ans=0.0 2023-10-12 16:59:35,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1122408.0, ans=0.125 2023-10-12 16:59:41,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.35 vs. limit=12.0 2023-10-12 16:59:43,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1122454.6666666667, ans=0.95 2023-10-12 17:00:13,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1122548.0, ans=0.125 2023-10-12 17:00:39,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1122641.3333333333, ans=0.1 2023-10-12 17:00:49,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1122688.0, ans=0.125 2023-10-12 17:00:52,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1122688.0, ans=0.0 2023-10-12 17:00:54,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1122734.6666666667, ans=0.95 2023-10-12 17:00:58,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.693e+02 1.898e+02 2.104e+02 2.605e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-12 17:00:59,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1122734.6666666667, ans=0.125 2023-10-12 17:01:20,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1122781.3333333333, ans=0.0 2023-10-12 17:01:35,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1122828.0, ans=0.1 2023-10-12 17:01:46,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1122874.6666666667, ans=0.2 2023-10-12 17:01:48,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1122874.6666666667, ans=0.125 2023-10-12 17:01:54,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1122921.3333333333, ans=0.125 2023-10-12 17:01:59,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1122921.3333333333, ans=0.125 2023-10-12 17:02:00,837 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.01 vs. limit=22.5 2023-10-12 17:02:01,421 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:02:05,066 INFO [train.py:1031] (3/4) Epoch 18, batch 8500, loss[loss=0.281, simple_loss=0.3394, pruned_loss=0.1113, over 15595.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2838, pruned_loss=0.05075, over 32351836.89 frames. ], batch size: 350, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:02:05,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1122968.0, ans=0.0 2023-10-12 17:02:18,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123014.6666666667, ans=0.1 2023-10-12 17:02:19,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1123014.6666666667, ans=0.0 2023-10-12 17:02:37,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1123061.3333333333, ans=10.0 2023-10-12 17:02:37,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.67 vs. limit=15.0 2023-10-12 17:02:42,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.22 vs. limit=15.0 2023-10-12 17:02:45,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=22.5 2023-10-12 17:02:55,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=12.0 2023-10-12 17:03:02,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1123201.3333333333, ans=0.0 2023-10-12 17:03:03,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.722e+02 1.927e+02 2.226e+02 2.929e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-12 17:03:21,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.87 vs. limit=15.0 2023-10-12 17:03:41,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1123341.3333333333, ans=0.2 2023-10-12 17:03:48,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1123341.3333333333, ans=0.125 2023-10-12 17:03:57,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1123388.0, ans=0.125 2023-10-12 17:04:25,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1123481.3333333333, ans=0.125 2023-10-12 17:04:29,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1123481.3333333333, ans=0.125 2023-10-12 17:04:29,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1123481.3333333333, ans=0.0 2023-10-12 17:04:32,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1123528.0, ans=0.0 2023-10-12 17:04:56,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1123621.3333333333, ans=0.125 2023-10-12 17:05:03,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1123621.3333333333, ans=0.0 2023-10-12 17:05:08,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.740e+02 1.907e+02 2.187e+02 3.066e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 17:05:37,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1123761.3333333333, ans=0.125 2023-10-12 17:05:43,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1123761.3333333333, ans=0.0 2023-10-12 17:05:50,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1123808.0, ans=0.125 2023-10-12 17:05:59,477 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:06:03,974 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=22.5 2023-10-12 17:06:08,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1123854.6666666667, ans=0.0 2023-10-12 17:06:10,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123854.6666666667, ans=0.1 2023-10-12 17:06:41,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1123994.6666666667, ans=0.125 2023-10-12 17:06:45,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1123994.6666666667, ans=0.1 2023-10-12 17:07:07,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1124088.0, ans=0.015 2023-10-12 17:07:15,612 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.712e+02 1.942e+02 2.231e+02 3.364e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-12 17:07:15,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1124134.6666666667, ans=0.0 2023-10-12 17:07:37,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1124181.3333333333, ans=0.1 2023-10-12 17:07:45,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1124228.0, ans=0.05 2023-10-12 17:07:46,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-10-12 17:08:23,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.35 vs. limit=6.0 2023-10-12 17:08:33,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1124414.6666666667, ans=0.125 2023-10-12 17:08:44,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1124461.3333333333, ans=0.125 2023-10-12 17:08:58,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1124508.0, ans=0.2 2023-10-12 17:09:03,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1124554.6666666667, ans=0.125 2023-10-12 17:09:13,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.690e+02 1.881e+02 2.115e+02 2.777e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-12 17:09:13,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-10-12 17:09:25,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=22.5 2023-10-12 17:09:36,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1124694.6666666667, ans=0.1 2023-10-12 17:10:05,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1124788.0, ans=0.125 2023-10-12 17:10:45,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-10-12 17:10:48,348 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:11:03,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.792e+02 1.962e+02 2.297e+02 3.365e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 17:11:07,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1125068.0, ans=0.125 2023-10-12 17:11:46,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1125254.6666666667, ans=0.0 2023-10-12 17:11:49,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1125254.6666666667, ans=0.2 2023-10-12 17:11:54,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1125254.6666666667, ans=0.125 2023-10-12 17:11:56,932 INFO [train.py:1031] (3/4) Epoch 18, batch 9000, loss[loss=0.1951, simple_loss=0.2906, pruned_loss=0.04975, over 16856.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2833, pruned_loss=0.05063, over 32432445.72 frames. ], batch size: 72, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:11:59,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125301.3333333333, ans=0.1 2023-10-12 17:12:07,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.02 vs. limit=15.0 2023-10-12 17:12:41,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1125441.3333333333, ans=0.2 2023-10-12 17:12:55,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.843e+02 1.996e+02 2.183e+02 3.098e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-12 17:12:56,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1125534.6666666667, ans=0.95 2023-10-12 17:13:17,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1125628.0, ans=0.125 2023-10-12 17:13:24,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1125674.6666666667, ans=0.125 2023-10-12 17:13:30,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1125674.6666666667, ans=0.04949747468305833 2023-10-12 17:13:32,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1125674.6666666667, ans=0.0 2023-10-12 17:13:34,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1125674.6666666667, ans=0.125 2023-10-12 17:13:47,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1125768.0, ans=0.1 2023-10-12 17:14:13,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1125861.3333333333, ans=0.125 2023-10-12 17:14:43,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1126001.3333333333, ans=0.125 2023-10-12 17:14:44,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.749e+02 1.890e+02 2.092e+02 2.837e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 17:14:47,671 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:15:04,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-10-12 17:15:23,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1126141.3333333333, ans=0.125 2023-10-12 17:15:34,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1126188.0, ans=0.2 2023-10-12 17:15:58,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-12 17:16:13,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.66 vs. limit=15.0 2023-10-12 17:16:15,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126374.6666666667, ans=0.1 2023-10-12 17:16:15,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-10-12 17:16:31,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.814e+02 2.013e+02 2.219e+02 2.979e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-12 17:16:37,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.02 vs. limit=15.0 2023-10-12 17:16:39,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1126514.6666666667, ans=0.0 2023-10-12 17:16:55,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1126561.3333333333, ans=0.125 2023-10-12 17:17:08,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1126608.0, ans=0.125 2023-10-12 17:17:16,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1126654.6666666667, ans=0.125 2023-10-12 17:17:19,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 17:17:24,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1126701.3333333333, ans=0.05 2023-10-12 17:17:26,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1126701.3333333333, ans=0.125 2023-10-12 17:17:29,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1126701.3333333333, ans=0.125 2023-10-12 17:17:35,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126748.0, ans=0.1 2023-10-12 17:17:55,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-12 17:18:03,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1126841.3333333333, ans=0.2 2023-10-12 17:18:24,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.755e+02 1.927e+02 2.145e+02 3.447e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-12 17:18:35,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.25 vs. limit=10.0 2023-10-12 17:18:47,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1126981.3333333333, ans=0.2 2023-10-12 17:18:52,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1127028.0, ans=10.0 2023-10-12 17:19:03,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1127074.6666666667, ans=0.04949747468305833 2023-10-12 17:19:11,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1127074.6666666667, ans=0.1 2023-10-12 17:19:40,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-12 17:19:48,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1127261.3333333333, ans=0.0 2023-10-12 17:19:53,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0 2023-10-12 17:20:03,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1127308.0, ans=0.125 2023-10-12 17:20:10,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-10-12 17:20:18,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2023-10-12 17:20:23,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1127401.3333333333, ans=0.1 2023-10-12 17:20:28,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.814e+02 2.044e+02 2.273e+02 3.060e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-12 17:20:31,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.91 vs. limit=15.0 2023-10-12 17:21:12,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1127541.3333333333, ans=0.2 2023-10-12 17:21:25,815 INFO [train.py:1031] (3/4) Epoch 18, batch 9500, loss[loss=0.2077, simple_loss=0.2874, pruned_loss=0.06398, over 15268.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.284, pruned_loss=0.0508, over 32523992.64 frames. ], batch size: 35, lr: 1.92e-03, grad_scale: 16.0 2023-10-12 17:21:29,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1127634.6666666667, ans=0.0 2023-10-12 17:21:40,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1127681.3333333333, ans=0.125 2023-10-12 17:21:43,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1127681.3333333333, ans=0.2 2023-10-12 17:21:53,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-10-12 17:21:54,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-12 17:22:14,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1127821.3333333333, ans=0.0 2023-10-12 17:22:28,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.784e+02 1.917e+02 2.234e+02 3.245e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-12 17:22:29,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1127868.0, ans=0.125 2023-10-12 17:22:57,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-10-12 17:22:59,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1128008.0, ans=0.0 2023-10-12 17:23:04,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1128008.0, ans=0.2 2023-10-12 17:23:05,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1128008.0, ans=0.125 2023-10-12 17:23:05,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1128008.0, ans=0.1 2023-10-12 17:23:08,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-10-12 17:23:20,888 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:23:47,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=12.0 2023-10-12 17:23:53,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.85 vs. limit=12.0 2023-10-12 17:23:55,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1128194.6666666667, ans=0.125 2023-10-12 17:23:58,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1128241.3333333333, ans=0.0 2023-10-12 17:24:07,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=1128241.3333333333, ans=15.0 2023-10-12 17:24:14,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1128288.0, ans=0.1 2023-10-12 17:24:24,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1128334.6666666667, ans=0.125 2023-10-12 17:24:26,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1128334.6666666667, ans=0.125 2023-10-12 17:24:26,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1128334.6666666667, ans=0.125 2023-10-12 17:24:27,595 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.708e+02 1.880e+02 2.139e+02 2.628e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-12 17:24:29,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1128334.6666666667, ans=10.0 2023-10-12 17:24:33,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1128334.6666666667, ans=0.125 2023-10-12 17:24:33,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1128334.6666666667, ans=0.125 2023-10-12 17:24:44,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1128381.3333333333, ans=0.1 2023-10-12 17:24:48,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-12 17:25:18,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1128521.3333333333, ans=0.2 2023-10-12 17:25:30,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.92 vs. limit=15.0 2023-10-12 17:25:34,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1128614.6666666667, ans=0.1 2023-10-12 17:25:34,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1128614.6666666667, ans=0.2 2023-10-12 17:25:42,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1128661.3333333333, ans=0.2 2023-10-12 17:25:52,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1128661.3333333333, ans=0.125 2023-10-12 17:25:55,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1128708.0, ans=0.0 2023-10-12 17:25:57,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1128708.0, ans=0.0 2023-10-12 17:26:20,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.749e+02 1.924e+02 2.107e+02 2.814e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-12 17:26:34,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1128848.0, ans=0.09899494936611666 2023-10-12 17:26:36,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1128848.0, ans=0.125 2023-10-12 17:27:17,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1129034.6666666667, ans=0.125 2023-10-12 17:27:32,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-12 17:27:49,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1129174.6666666667, ans=0.125 2023-10-12 17:28:04,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=12.0 2023-10-12 17:28:07,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1129221.3333333333, ans=0.04949747468305833 2023-10-12 17:28:10,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=15.0 2023-10-12 17:28:12,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2023-10-12 17:28:14,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.736e+02 1.885e+02 2.105e+02 2.832e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-12 17:28:16,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129268.0, ans=0.1 2023-10-12 17:28:28,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=12.0 2023-10-12 17:28:39,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1129361.3333333333, ans=0.125 2023-10-12 17:28:40,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-10-12 17:28:51,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129408.0, ans=0.1 2023-10-12 17:28:56,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1129454.6666666667, ans=0.1 2023-10-12 17:28:58,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1129454.6666666667, ans=0.0 2023-10-12 17:29:11,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1129501.3333333333, ans=0.0 2023-10-12 17:29:24,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1129548.0, ans=0.125 2023-10-12 17:29:33,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1129594.6666666667, ans=0.125 2023-10-12 17:30:03,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.724e+02 1.936e+02 2.207e+02 2.932e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-12 17:30:15,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1129781.3333333333, ans=0.125 2023-10-12 17:30:15,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=15.0 2023-10-12 17:30:28,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129828.0, ans=0.1 2023-10-12 17:30:34,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1129874.6666666667, ans=0.125 2023-10-12 17:30:38,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=1129874.6666666667, ans=15.0 2023-10-12 17:30:53,575 INFO [train.py:1031] (3/4) Epoch 18, batch 10000, loss[loss=0.2337, simple_loss=0.312, pruned_loss=0.0777, over 16521.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2832, pruned_loss=0.05059, over 32572256.98 frames. ], batch size: 266, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:30:57,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1129968.0, ans=0.95 2023-10-12 17:31:03,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1130014.6666666667, ans=0.125 2023-10-12 17:31:18,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1130061.3333333333, ans=0.0 2023-10-12 17:31:29,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1130108.0, ans=0.125 2023-10-12 17:31:32,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1130108.0, ans=0.125 2023-10-12 17:31:37,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1130154.6666666667, ans=0.025 2023-10-12 17:31:51,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1130201.3333333333, ans=0.5 2023-10-12 17:31:55,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.762e+02 1.909e+02 2.073e+02 3.129e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-12 17:31:56,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1130201.3333333333, ans=0.125 2023-10-12 17:32:14,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1130294.6666666667, ans=0.125 2023-10-12 17:33:21,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1130481.3333333333, ans=0.1 2023-10-12 17:33:37,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-12 17:33:47,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1130574.6666666667, ans=0.125 2023-10-12 17:33:49,345 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:34:07,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.836e+02 2.011e+02 2.219e+02 2.829e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-12 17:34:10,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1130668.0, ans=0.1 2023-10-12 17:34:27,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1130761.3333333333, ans=0.125 2023-10-12 17:34:57,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-12 17:35:08,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-12 17:35:18,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1130948.0, ans=0.0 2023-10-12 17:35:18,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1130948.0, ans=0.125 2023-10-12 17:35:31,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1130994.6666666667, ans=0.0 2023-10-12 17:35:47,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1131041.3333333333, ans=0.125 2023-10-12 17:36:32,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.725e+02 1.871e+02 2.081e+02 3.009e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-12 17:36:35,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.96 vs. limit=15.0 2023-10-12 17:36:54,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1131228.0, ans=0.125 2023-10-12 17:36:59,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1131228.0, ans=0.2 2023-10-12 17:37:03,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1131274.6666666667, ans=0.125 2023-10-12 17:37:19,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.56 vs. limit=10.0 2023-10-12 17:37:37,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1131368.0, ans=0.125 2023-10-12 17:37:37,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1131368.0, ans=0.2 2023-10-12 17:37:42,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.58 vs. limit=15.0 2023-10-12 17:38:05,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1131508.0, ans=0.125 2023-10-12 17:38:13,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.94 vs. limit=15.0 2023-10-12 17:38:26,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1131554.6666666667, ans=0.5 2023-10-12 17:38:34,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.695e+02 1.908e+02 2.104e+02 3.368e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 17:38:45,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1131648.0, ans=0.125 2023-10-12 17:38:47,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1131648.0, ans=0.2 2023-10-12 17:38:47,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.09 vs. limit=15.0 2023-10-12 17:39:02,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1131741.3333333333, ans=0.125 2023-10-12 17:39:18,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1131788.0, ans=0.0 2023-10-12 17:39:42,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.43 vs. limit=6.0 2023-10-12 17:39:54,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1131928.0, ans=0.2 2023-10-12 17:39:56,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1131928.0, ans=0.0 2023-10-12 17:40:36,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1132068.0, ans=10.0 2023-10-12 17:40:37,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.727e+02 1.836e+02 2.090e+02 2.772e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-12 17:40:37,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1132068.0, ans=0.1 2023-10-12 17:40:53,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1132161.3333333333, ans=0.0 2023-10-12 17:40:56,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=22.5 2023-10-12 17:40:57,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1132161.3333333333, ans=0.125 2023-10-12 17:41:12,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-10-12 17:41:26,776 INFO [train.py:1031] (3/4) Epoch 18, batch 10500, loss[loss=0.2326, simple_loss=0.3024, pruned_loss=0.08134, over 15689.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2837, pruned_loss=0.05076, over 32638965.91 frames. ], batch size: 350, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:41:32,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1132301.3333333333, ans=0.07 2023-10-12 17:42:39,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1132488.0, ans=0.1 2023-10-12 17:42:46,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1132534.6666666667, ans=0.125 2023-10-12 17:42:48,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.733e+02 1.895e+02 2.115e+02 3.612e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-12 17:42:55,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2023-10-12 17:43:10,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1132628.0, ans=0.125 2023-10-12 17:43:19,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1132628.0, ans=0.1 2023-10-12 17:43:24,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1132674.6666666667, ans=0.0 2023-10-12 17:43:32,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1132674.6666666667, ans=0.0 2023-10-12 17:43:42,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1132721.3333333333, ans=0.125 2023-10-12 17:43:48,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-12 17:44:05,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1132814.6666666667, ans=10.0 2023-10-12 17:44:24,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1132908.0, ans=0.0 2023-10-12 17:44:29,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1132908.0, ans=0.125 2023-10-12 17:44:47,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1133001.3333333333, ans=0.2 2023-10-12 17:44:51,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.688e+02 1.798e+02 1.960e+02 2.590e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-12 17:44:51,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1133001.3333333333, ans=0.0 2023-10-12 17:44:54,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1133048.0, ans=0.125 2023-10-12 17:45:04,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1133048.0, ans=0.125 2023-10-12 17:45:20,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1133141.3333333333, ans=0.2 2023-10-12 17:45:39,373 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:45:43,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1133188.0, ans=0.125 2023-10-12 17:45:56,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1133234.6666666667, ans=0.125 2023-10-12 17:45:57,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133234.6666666667, ans=0.1 2023-10-12 17:46:03,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1133281.3333333333, ans=0.05 2023-10-12 17:46:13,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1133328.0, ans=0.2 2023-10-12 17:46:15,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1133328.0, ans=0.125 2023-10-12 17:46:18,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.38 vs. limit=22.5 2023-10-12 17:46:26,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1133374.6666666667, ans=0.0 2023-10-12 17:46:34,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1133374.6666666667, ans=0.2 2023-10-12 17:46:49,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1133421.3333333333, ans=0.0 2023-10-12 17:46:49,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1133421.3333333333, ans=0.125 2023-10-12 17:46:57,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.811e+02 2.005e+02 2.308e+02 3.295e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-12 17:47:25,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.51 vs. limit=15.0 2023-10-12 17:47:33,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1133608.0, ans=0.125 2023-10-12 17:47:46,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1133654.6666666667, ans=0.125 2023-10-12 17:47:49,126 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:48:15,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1133794.6666666667, ans=0.0 2023-10-12 17:48:20,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1133841.3333333333, ans=0.125 2023-10-12 17:48:47,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.770e+02 2.007e+02 2.260e+02 3.002e+02, threshold=4.014e+02, percent-clipped=0.0 2023-10-12 17:48:59,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1133981.3333333333, ans=0.125 2023-10-12 17:49:11,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1134028.0, ans=0.125 2023-10-12 17:49:27,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1134074.6666666667, ans=0.1 2023-10-12 17:49:59,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1134214.6666666667, ans=10.0 2023-10-12 17:50:01,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1134214.6666666667, ans=0.0 2023-10-12 17:50:05,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-10-12 17:50:05,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1134214.6666666667, ans=0.125 2023-10-12 17:50:06,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=22.5 2023-10-12 17:50:26,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.56 vs. limit=15.0 2023-10-12 17:50:28,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1134308.0, ans=0.0 2023-10-12 17:50:30,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1134308.0, ans=0.0 2023-10-12 17:50:34,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.73 vs. limit=15.0 2023-10-12 17:50:49,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.657e+02 1.876e+02 2.024e+02 3.344e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 17:50:59,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1134448.0, ans=0.0 2023-10-12 17:50:59,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2023-10-12 17:51:01,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.65 vs. limit=6.0 2023-10-12 17:51:09,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1134494.6666666667, ans=0.125 2023-10-12 17:51:15,462 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:51:16,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1134541.3333333333, ans=0.2 2023-10-12 17:51:21,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1134541.3333333333, ans=0.125 2023-10-12 17:51:29,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=12.0 2023-10-12 17:51:39,254 INFO [train.py:1031] (3/4) Epoch 18, batch 11000, loss[loss=0.1802, simple_loss=0.283, pruned_loss=0.0387, over 16915.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2837, pruned_loss=0.05082, over 32651088.69 frames. ], batch size: 104, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 17:51:50,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1134681.3333333333, ans=0.0 2023-10-12 17:51:52,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1134681.3333333333, ans=0.125 2023-10-12 17:52:05,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1134728.0, ans=0.09899494936611666 2023-10-12 17:52:16,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134774.6666666667, ans=0.1 2023-10-12 17:52:17,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1134774.6666666667, ans=0.125 2023-10-12 17:52:20,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1134774.6666666667, ans=0.2 2023-10-12 17:52:20,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2023-10-12 17:52:35,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1134868.0, ans=0.0 2023-10-12 17:52:42,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.782e+02 1.975e+02 2.259e+02 3.388e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-12 17:52:49,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1134914.6666666667, ans=0.125 2023-10-12 17:53:00,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-12 17:53:47,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1135101.3333333333, ans=0.125 2023-10-12 17:54:14,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135241.3333333333, ans=0.1 2023-10-12 17:54:21,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1135241.3333333333, ans=0.125 2023-10-12 17:54:30,304 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.94 vs. limit=22.5 2023-10-12 17:54:33,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1135288.0, ans=0.125 2023-10-12 17:54:47,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1135334.6666666667, ans=0.2 2023-10-12 17:54:50,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.639e+02 1.818e+02 1.970e+02 3.123e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-12 17:55:01,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1135381.3333333333, ans=0.125 2023-10-12 17:55:02,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1135381.3333333333, ans=0.125 2023-10-12 17:55:13,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1135428.0, ans=0.125 2023-10-12 17:55:13,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1135428.0, ans=0.07 2023-10-12 17:55:18,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1135474.6666666667, ans=0.125 2023-10-12 17:55:18,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1135474.6666666667, ans=0.2 2023-10-12 17:55:30,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1135521.3333333333, ans=0.125 2023-10-12 17:55:30,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1135521.3333333333, ans=0.0 2023-10-12 17:55:40,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1135568.0, ans=0.125 2023-10-12 17:55:46,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1135568.0, ans=0.125 2023-10-12 17:55:52,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1135614.6666666667, ans=0.125 2023-10-12 17:55:55,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.37 vs. limit=10.0 2023-10-12 17:55:55,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1135614.6666666667, ans=0.0 2023-10-12 17:56:08,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1135661.3333333333, ans=0.2 2023-10-12 17:56:10,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1135661.3333333333, ans=0.0 2023-10-12 17:56:40,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1135801.3333333333, ans=0.125 2023-10-12 17:56:41,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.690e+02 1.832e+02 2.010e+02 2.819e+02, threshold=3.664e+02, percent-clipped=0.0 2023-10-12 17:56:55,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1135848.0, ans=0.0 2023-10-12 17:56:58,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1135894.6666666667, ans=0.125 2023-10-12 17:57:13,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1135941.3333333333, ans=0.2 2023-10-12 17:57:23,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-10-12 17:57:44,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1136034.6666666667, ans=0.125 2023-10-12 17:57:48,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1136081.3333333333, ans=0.125 2023-10-12 17:57:58,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1136081.3333333333, ans=0.125 2023-10-12 17:58:19,195 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:58:23,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1136221.3333333333, ans=0.125 2023-10-12 17:58:31,358 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:58:34,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1136221.3333333333, ans=0.0 2023-10-12 17:58:37,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1136268.0, ans=0.0 2023-10-12 17:58:42,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.11 vs. limit=22.5 2023-10-12 17:58:42,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.738e+02 1.857e+02 2.075e+02 3.104e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-12 17:58:45,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136268.0, ans=0.1 2023-10-12 17:58:45,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=22.5 2023-10-12 17:58:57,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1136314.6666666667, ans=0.04949747468305833 2023-10-12 17:59:04,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1136361.3333333333, ans=0.125 2023-10-12 17:59:17,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1136454.6666666667, ans=0.125 2023-10-12 17:59:28,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-12 17:59:57,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1136594.6666666667, ans=0.0 2023-10-12 18:00:12,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136641.3333333333, ans=0.1 2023-10-12 18:00:26,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1136688.0, ans=0.125 2023-10-12 18:00:35,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.809e+02 2.024e+02 2.403e+02 3.192e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-12 18:00:46,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1136781.3333333333, ans=0.125 2023-10-12 18:00:46,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136781.3333333333, ans=0.1 2023-10-12 18:00:53,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-10-12 18:00:56,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136828.0, ans=0.1 2023-10-12 18:00:58,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1136828.0, ans=0.0 2023-10-12 18:00:58,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1136828.0, ans=0.125 2023-10-12 18:00:59,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1136828.0, ans=0.2 2023-10-12 18:01:08,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1136874.6666666667, ans=0.125 2023-10-12 18:01:10,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1136874.6666666667, ans=0.125 2023-10-12 18:01:27,056 INFO [train.py:1031] (3/4) Epoch 18, batch 11500, loss[loss=0.1679, simple_loss=0.245, pruned_loss=0.04544, over 12734.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2833, pruned_loss=0.05061, over 32669327.16 frames. ], batch size: 440, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:01:31,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=22.5 2023-10-12 18:01:32,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1136968.0, ans=0.2 2023-10-12 18:01:33,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1136968.0, ans=0.0 2023-10-12 18:01:36,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1136968.0, ans=0.0 2023-10-12 18:02:00,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1137108.0, ans=0.0 2023-10-12 18:02:24,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=15.0 2023-10-12 18:02:33,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.735e+02 1.907e+02 2.151e+02 2.779e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 18:02:47,064 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:02:55,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1137294.6666666667, ans=0.125 2023-10-12 18:03:17,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1137388.0, ans=0.125 2023-10-12 18:03:42,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1137481.3333333333, ans=0.2 2023-10-12 18:03:52,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1137528.0, ans=0.125 2023-10-12 18:04:03,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1137574.6666666667, ans=0.09899494936611666 2023-10-12 18:04:11,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1137574.6666666667, ans=0.1 2023-10-12 18:04:19,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=15.0 2023-10-12 18:04:29,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.675e+02 1.787e+02 1.966e+02 2.902e+02, threshold=3.574e+02, percent-clipped=0.0 2023-10-12 18:04:30,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-10-12 18:04:30,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1137668.0, ans=0.125 2023-10-12 18:04:48,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137761.3333333333, ans=0.1 2023-10-12 18:04:56,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1137808.0, ans=0.125 2023-10-12 18:05:31,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1137948.0, ans=0.07 2023-10-12 18:05:41,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.16 vs. limit=22.5 2023-10-12 18:05:51,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.29 vs. limit=15.0 2023-10-12 18:05:52,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1138041.3333333333, ans=0.125 2023-10-12 18:06:00,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1138088.0, ans=0.125 2023-10-12 18:06:13,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1138134.6666666667, ans=0.125 2023-10-12 18:06:22,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.714e+02 1.887e+02 2.320e+02 2.923e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 18:06:28,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1138181.3333333333, ans=0.1 2023-10-12 18:06:28,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=15.0 2023-10-12 18:06:32,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1138181.3333333333, ans=0.125 2023-10-12 18:06:39,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1138181.3333333333, ans=0.2 2023-10-12 18:06:56,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1138228.0, ans=0.05 2023-10-12 18:06:57,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2023-10-12 18:06:59,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1138274.6666666667, ans=0.125 2023-10-12 18:07:07,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1138274.6666666667, ans=0.125 2023-10-12 18:07:13,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1138321.3333333333, ans=0.125 2023-10-12 18:07:18,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-10-12 18:07:30,246 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.22 vs. limit=15.0 2023-10-12 18:07:37,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138414.6666666667, ans=0.1 2023-10-12 18:07:47,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1138461.3333333333, ans=0.125 2023-10-12 18:07:59,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1138508.0, ans=0.125 2023-10-12 18:08:02,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1138508.0, ans=0.0 2023-10-12 18:08:04,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1138508.0, ans=0.125 2023-10-12 18:08:06,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.24 vs. limit=22.5 2023-10-12 18:08:30,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.715e+02 1.855e+02 2.111e+02 2.794e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-12 18:08:33,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.55 vs. limit=22.5 2023-10-12 18:08:48,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1138694.6666666667, ans=0.1 2023-10-12 18:08:52,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1138694.6666666667, ans=0.0 2023-10-12 18:08:59,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.58 vs. limit=22.5 2023-10-12 18:09:03,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.98 vs. limit=10.0 2023-10-12 18:09:29,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1138834.6666666667, ans=0.125 2023-10-12 18:09:32,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1138834.6666666667, ans=0.125 2023-10-12 18:09:32,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1138834.6666666667, ans=0.0 2023-10-12 18:09:39,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1138881.3333333333, ans=0.2 2023-10-12 18:10:04,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1138974.6666666667, ans=0.125 2023-10-12 18:10:06,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1139021.3333333333, ans=0.07 2023-10-12 18:10:18,818 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:10:26,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-10-12 18:10:28,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.760e+02 1.935e+02 2.106e+02 3.226e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 18:10:31,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1139068.0, ans=0.2 2023-10-12 18:10:32,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1139114.6666666667, ans=0.125 2023-10-12 18:10:40,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139114.6666666667, ans=0.1 2023-10-12 18:10:46,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139161.3333333333, ans=0.1 2023-10-12 18:10:55,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.42 vs. limit=22.5 2023-10-12 18:10:59,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1139208.0, ans=0.0 2023-10-12 18:11:07,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-10-12 18:11:19,484 INFO [train.py:1031] (3/4) Epoch 18, batch 12000, loss[loss=0.1939, simple_loss=0.2864, pruned_loss=0.05073, over 17039.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2834, pruned_loss=0.05028, over 32723852.04 frames. ], batch size: 117, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:11:48,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1139394.6666666667, ans=0.07 2023-10-12 18:11:49,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1139394.6666666667, ans=0.0 2023-10-12 18:12:07,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1139488.0, ans=0.125 2023-10-12 18:12:13,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.01 vs. limit=15.0 2023-10-12 18:12:14,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1139488.0, ans=0.0 2023-10-12 18:12:17,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.97 vs. limit=15.0 2023-10-12 18:12:22,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1139534.6666666667, ans=0.125 2023-10-12 18:12:24,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.684e+02 1.843e+02 2.039e+02 3.682e+02, threshold=3.687e+02, percent-clipped=0.0 2023-10-12 18:12:31,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1139581.3333333333, ans=0.0 2023-10-12 18:12:46,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139628.0, ans=0.1 2023-10-12 18:12:48,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.31 vs. limit=15.0 2023-10-12 18:13:17,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1139768.0, ans=0.125 2023-10-12 18:13:50,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1139908.0, ans=0.125 2023-10-12 18:13:56,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1139908.0, ans=10.0 2023-10-12 18:14:05,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1139954.6666666667, ans=0.0 2023-10-12 18:14:17,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1140001.3333333333, ans=0.125 2023-10-12 18:14:25,302 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:14:29,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.712e+02 1.882e+02 2.059e+02 3.013e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 18:15:03,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-12 18:15:18,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1140141.3333333333, ans=0.0 2023-10-12 18:16:11,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1140374.6666666667, ans=0.025 2023-10-12 18:16:13,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1140374.6666666667, ans=0.125 2023-10-12 18:16:13,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-10-12 18:16:31,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1140468.0, ans=0.0 2023-10-12 18:16:38,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.739e+02 1.922e+02 2.081e+02 2.744e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-12 18:17:02,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1140561.3333333333, ans=0.125 2023-10-12 18:17:04,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1140608.0, ans=0.025 2023-10-12 18:17:26,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1140701.3333333333, ans=0.125 2023-10-12 18:17:30,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1140701.3333333333, ans=0.2 2023-10-12 18:17:39,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1140748.0, ans=0.0 2023-10-12 18:17:40,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1140748.0, ans=0.125 2023-10-12 18:17:54,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1140794.6666666667, ans=0.125 2023-10-12 18:18:13,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1140888.0, ans=0.0 2023-10-12 18:18:15,730 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-12 18:18:20,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1140888.0, ans=0.125 2023-10-12 18:18:33,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.787e+02 1.953e+02 2.241e+02 3.560e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 18:18:39,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1140981.3333333333, ans=0.125 2023-10-12 18:18:50,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.60 vs. limit=15.0 2023-10-12 18:18:58,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-10-12 18:19:08,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=15.0 2023-10-12 18:19:17,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1141121.3333333333, ans=0.05 2023-10-12 18:19:26,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1141168.0, ans=0.125 2023-10-12 18:19:34,338 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-12 18:19:58,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1141308.0, ans=0.0 2023-10-12 18:20:08,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1141354.6666666667, ans=0.125 2023-10-12 18:20:08,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1141354.6666666667, ans=0.0 2023-10-12 18:20:22,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1141401.3333333333, ans=0.0 2023-10-12 18:20:27,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.714e+02 1.904e+02 2.053e+02 3.023e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 18:20:36,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1141448.0, ans=0.0 2023-10-12 18:20:43,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1141494.6666666667, ans=0.2 2023-10-12 18:20:54,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1141541.3333333333, ans=0.04949747468305833 2023-10-12 18:20:56,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.12 vs. limit=22.5 2023-10-12 18:20:58,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-10-12 18:21:05,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-10-12 18:21:09,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1141588.0, ans=0.0 2023-10-12 18:21:13,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1141588.0, ans=0.0 2023-10-12 18:21:17,778 INFO [train.py:1031] (3/4) Epoch 18, batch 12500, loss[loss=0.179, simple_loss=0.2787, pruned_loss=0.0396, over 16327.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2831, pruned_loss=0.05039, over 32742301.99 frames. ], batch size: 50, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:21:31,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141681.3333333333, ans=0.125 2023-10-12 18:21:47,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1141728.0, ans=0.2 2023-10-12 18:22:19,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.838e+02 2.086e+02 2.392e+02 3.402e+02, threshold=4.172e+02, percent-clipped=0.0 2023-10-12 18:22:33,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-12 18:22:38,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1141961.3333333333, ans=0.125 2023-10-12 18:22:46,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1142008.0, ans=0.2 2023-10-12 18:23:02,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1142054.6666666667, ans=0.05 2023-10-12 18:23:02,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1142054.6666666667, ans=0.1 2023-10-12 18:23:22,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1142148.0, ans=0.0 2023-10-12 18:23:32,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1142194.6666666667, ans=0.125 2023-10-12 18:23:50,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1142241.3333333333, ans=0.0 2023-10-12 18:24:10,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1142334.6666666667, ans=0.2 2023-10-12 18:24:12,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.712e+02 1.913e+02 2.162e+02 2.701e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-12 18:24:29,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1142428.0, ans=0.07 2023-10-12 18:24:50,532 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-10-12 18:24:51,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.00 vs. limit=22.5 2023-10-12 18:24:57,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1142521.3333333333, ans=0.125 2023-10-12 18:25:01,986 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:25:38,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1142708.0, ans=0.0 2023-10-12 18:25:47,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.38 vs. limit=15.0 2023-10-12 18:25:57,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-10-12 18:26:09,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-10-12 18:26:09,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.728e+02 1.889e+02 2.100e+02 2.775e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 18:26:45,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1142941.3333333333, ans=0.125 2023-10-12 18:27:45,877 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:28:05,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.725e+02 1.888e+02 2.146e+02 3.261e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 18:28:39,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1143408.0, ans=0.1 2023-10-12 18:28:52,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1143454.6666666667, ans=0.125 2023-10-12 18:29:07,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1143501.3333333333, ans=0.125 2023-10-12 18:29:15,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-10-12 18:29:28,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-12 18:29:37,128 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:30:03,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.739e+02 1.923e+02 2.098e+02 3.034e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-12 18:30:05,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1143781.3333333333, ans=0.05 2023-10-12 18:30:12,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1143781.3333333333, ans=0.1 2023-10-12 18:30:21,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143828.0, ans=0.1 2023-10-12 18:30:23,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1143828.0, ans=0.0 2023-10-12 18:30:35,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1143874.6666666667, ans=0.125 2023-10-12 18:30:37,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=22.5 2023-10-12 18:30:42,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1143921.3333333333, ans=0.1 2023-10-12 18:30:48,184 INFO [train.py:1031] (3/4) Epoch 18, batch 13000, loss[loss=0.1864, simple_loss=0.2819, pruned_loss=0.04543, over 16655.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2837, pruned_loss=0.0506, over 32711025.22 frames. ], batch size: 61, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:30:53,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143968.0, ans=0.1 2023-10-12 18:31:07,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1144014.6666666667, ans=0.125 2023-10-12 18:31:31,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-10-12 18:32:00,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.50 vs. limit=22.5 2023-10-12 18:32:05,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.742e+02 1.891e+02 2.120e+02 3.097e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 18:32:21,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1144294.6666666667, ans=0.1 2023-10-12 18:32:21,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1144294.6666666667, ans=0.125 2023-10-12 18:32:31,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.96 vs. limit=15.0 2023-10-12 18:32:33,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1144341.3333333333, ans=0.0 2023-10-12 18:32:34,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1144341.3333333333, ans=0.0 2023-10-12 18:32:36,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-10-12 18:32:51,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1144388.0, ans=0.07 2023-10-12 18:32:52,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1144388.0, ans=0.125 2023-10-12 18:33:10,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1144481.3333333333, ans=0.2 2023-10-12 18:33:47,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1144621.3333333333, ans=0.04949747468305833 2023-10-12 18:33:59,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1144668.0, ans=0.1 2023-10-12 18:34:00,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.658e+02 1.854e+02 2.045e+02 3.100e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 18:34:03,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=12.0 2023-10-12 18:34:04,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1144714.6666666667, ans=15.0 2023-10-12 18:34:18,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1144761.3333333333, ans=0.125 2023-10-12 18:34:32,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1144808.0, ans=0.0 2023-10-12 18:34:42,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-10-12 18:34:59,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1144901.3333333333, ans=0.1 2023-10-12 18:35:15,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2023-10-12 18:35:17,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1144948.0, ans=0.125 2023-10-12 18:35:27,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1144994.6666666667, ans=0.0 2023-10-12 18:35:33,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1145041.3333333333, ans=0.1 2023-10-12 18:35:49,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1145088.0, ans=0.125 2023-10-12 18:36:04,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.699e+02 1.892e+02 2.071e+02 2.784e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-12 18:36:35,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1145274.6666666667, ans=0.125 2023-10-12 18:36:46,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1145321.3333333333, ans=0.125 2023-10-12 18:36:53,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.68 vs. limit=22.5 2023-10-12 18:37:01,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1145368.0, ans=0.0 2023-10-12 18:37:09,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1145414.6666666667, ans=0.1 2023-10-12 18:37:26,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1145508.0, ans=0.125 2023-10-12 18:37:30,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1145508.0, ans=0.125 2023-10-12 18:37:33,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1145508.0, ans=0.125 2023-10-12 18:37:56,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1145601.3333333333, ans=0.125 2023-10-12 18:37:58,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.739e+02 1.888e+02 2.031e+02 3.036e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 18:38:12,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.67 vs. limit=15.0 2023-10-12 18:38:27,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1145741.3333333333, ans=0.1 2023-10-12 18:38:32,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1145788.0, ans=0.0 2023-10-12 18:38:52,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1145834.6666666667, ans=0.0 2023-10-12 18:38:53,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-10-12 18:39:13,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1145928.0, ans=0.125 2023-10-12 18:39:15,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-10-12 18:39:19,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1145974.6666666667, ans=0.125 2023-10-12 18:39:35,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1146021.3333333333, ans=0.2 2023-10-12 18:39:37,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1146021.3333333333, ans=0.07 2023-10-12 18:39:50,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.770e+02 1.936e+02 2.172e+02 3.199e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-12 18:39:56,222 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:39:57,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1146114.6666666667, ans=0.0 2023-10-12 18:39:58,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1146114.6666666667, ans=0.125 2023-10-12 18:40:17,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1146208.0, ans=0.0 2023-10-12 18:40:35,971 INFO [train.py:1031] (3/4) Epoch 18, batch 13500, loss[loss=0.2432, simple_loss=0.314, pruned_loss=0.08621, over 15697.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2832, pruned_loss=0.05035, over 32750862.51 frames. ], batch size: 350, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:40:49,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1146348.0, ans=0.125 2023-10-12 18:41:07,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1146394.6666666667, ans=0.0 2023-10-12 18:41:41,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.736e+02 1.972e+02 2.196e+02 3.311e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-12 18:41:51,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.73 vs. limit=15.0 2023-10-12 18:42:09,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1146674.6666666667, ans=0.125 2023-10-12 18:42:18,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1146721.3333333333, ans=0.0 2023-10-12 18:42:22,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1146721.3333333333, ans=0.0 2023-10-12 18:42:27,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2023-10-12 18:42:55,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1146861.3333333333, ans=0.0 2023-10-12 18:42:56,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1146908.0, ans=0.02 2023-10-12 18:43:05,146 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=22.5 2023-10-12 18:43:09,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1146954.6666666667, ans=0.2 2023-10-12 18:43:16,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-12 18:43:55,192 INFO [train.py:1031] (3/4) Epoch 19, batch 0, loss[loss=0.165, simple_loss=0.2557, pruned_loss=0.03712, over 16964.00 frames. ], tot_loss[loss=0.165, simple_loss=0.2557, pruned_loss=0.03712, over 16964.00 frames. ], batch size: 117, lr: 1.85e-03, grad_scale: 32.0 2023-10-12 18:43:55,192 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-12 18:44:03,016 INFO [train.py:1063] (3/4) Epoch 19, validation: loss=0.2139, simple_loss=0.301, pruned_loss=0.06343, over 1020973.00 frames. 2023-10-12 18:44:03,016 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-12 18:44:07,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.748e+02 1.918e+02 2.200e+02 3.068e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 18:44:42,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1147164.6666666667, ans=0.125 2023-10-12 18:44:46,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1147164.6666666667, ans=0.05 2023-10-12 18:44:53,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-10-12 18:44:59,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1147258.0, ans=0.125 2023-10-12 18:45:07,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1147258.0, ans=0.125 2023-10-12 18:45:09,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1147258.0, ans=0.07 2023-10-12 18:45:09,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1147258.0, ans=0.0 2023-10-12 18:45:21,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1147304.6666666667, ans=0.0 2023-10-12 18:45:36,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1147398.0, ans=0.1 2023-10-12 18:45:39,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1147398.0, ans=0.05 2023-10-12 18:46:03,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.654e+02 1.823e+02 1.995e+02 2.699e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-12 18:46:08,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1147538.0, ans=0.0 2023-10-12 18:46:09,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1147538.0, ans=0.125 2023-10-12 18:46:20,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1147584.6666666667, ans=0.2 2023-10-12 18:46:36,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1147631.3333333333, ans=0.125 2023-10-12 18:46:46,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2023-10-12 18:46:46,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-10-12 18:47:04,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1147771.3333333333, ans=0.0 2023-10-12 18:47:05,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1147771.3333333333, ans=0.125 2023-10-12 18:47:25,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1147864.6666666667, ans=0.1 2023-10-12 18:47:31,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1147864.6666666667, ans=0.0 2023-10-12 18:47:33,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1147864.6666666667, ans=0.125 2023-10-12 18:47:41,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1147911.3333333333, ans=0.0 2023-10-12 18:47:41,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1147911.3333333333, ans=0.0 2023-10-12 18:47:45,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1147911.3333333333, ans=0.0 2023-10-12 18:47:48,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1147958.0, ans=0.0 2023-10-12 18:47:50,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1147958.0, ans=0.125 2023-10-12 18:47:53,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1147958.0, ans=0.1 2023-10-12 18:47:53,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-10-12 18:47:53,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.775e+02 1.952e+02 2.207e+02 2.966e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-12 18:47:58,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1147958.0, ans=0.125 2023-10-12 18:48:16,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1148051.3333333333, ans=0.125 2023-10-12 18:48:26,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1148098.0, ans=0.5 2023-10-12 18:48:47,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1148144.6666666667, ans=0.125 2023-10-12 18:48:53,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1148191.3333333333, ans=0.125 2023-10-12 18:48:57,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1148191.3333333333, ans=0.0 2023-10-12 18:49:02,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1148238.0, ans=0.0 2023-10-12 18:49:07,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=15.0 2023-10-12 18:49:08,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-10-12 18:49:27,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1148331.3333333333, ans=0.07 2023-10-12 18:49:33,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1148378.0, ans=0.0 2023-10-12 18:49:34,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1148378.0, ans=0.125 2023-10-12 18:49:48,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.669e+02 1.883e+02 2.102e+02 2.824e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 18:49:48,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1148424.6666666667, ans=10.0 2023-10-12 18:50:09,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1148518.0, ans=0.1 2023-10-12 18:50:22,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-12 18:50:30,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1148611.3333333333, ans=0.125 2023-10-12 18:50:30,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1148611.3333333333, ans=0.125 2023-10-12 18:50:44,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1148658.0, ans=0.2 2023-10-12 18:51:01,838 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=12.0 2023-10-12 18:51:06,411 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:51:08,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1148798.0, ans=0.125 2023-10-12 18:51:14,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1148798.0, ans=0.2 2023-10-12 18:51:24,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1148844.6666666667, ans=0.125 2023-10-12 18:51:30,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1148891.3333333333, ans=0.125 2023-10-12 18:51:35,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.774e+02 1.930e+02 2.181e+02 3.236e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 18:51:37,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1148891.3333333333, ans=0.125 2023-10-12 18:51:43,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1148938.0, ans=0.0 2023-10-12 18:52:31,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-10-12 18:52:50,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1149218.0, ans=0.1 2023-10-12 18:52:50,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1149218.0, ans=0.2 2023-10-12 18:52:51,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1149218.0, ans=0.0 2023-10-12 18:53:01,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1149264.6666666667, ans=0.125 2023-10-12 18:53:02,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149264.6666666667, ans=0.1 2023-10-12 18:53:08,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1149264.6666666667, ans=0.125 2023-10-12 18:53:13,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149264.6666666667, ans=0.1 2023-10-12 18:53:25,908 INFO [train.py:1031] (3/4) Epoch 19, batch 500, loss[loss=0.1736, simple_loss=0.2643, pruned_loss=0.04139, over 16920.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2833, pruned_loss=0.05037, over 7300078.02 frames. ], batch size: 116, lr: 1.85e-03, grad_scale: 16.0 2023-10-12 18:53:32,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.904e+02 2.168e+02 2.566e+02 3.752e+02, threshold=4.337e+02, percent-clipped=0.0 2023-10-12 18:53:53,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149451.3333333333, ans=0.1 2023-10-12 18:54:03,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1149498.0, ans=0.04949747468305833 2023-10-12 18:54:18,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1149544.6666666667, ans=0.125 2023-10-12 18:54:20,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1149591.3333333333, ans=0.125 2023-10-12 18:54:27,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1149591.3333333333, ans=0.125 2023-10-12 18:54:37,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1149638.0, ans=0.2 2023-10-12 18:55:05,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1149731.3333333333, ans=0.125 2023-10-12 18:55:12,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1149778.0, ans=0.125 2023-10-12 18:55:17,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1149778.0, ans=0.125 2023-10-12 18:55:22,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 18:55:25,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2023-10-12 18:55:27,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1149824.6666666667, ans=0.125 2023-10-12 18:55:28,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.762e+02 1.890e+02 2.119e+02 2.858e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 18:55:31,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149824.6666666667, ans=0.125 2023-10-12 18:55:33,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=22.5 2023-10-12 18:55:34,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149871.3333333333, ans=0.1 2023-10-12 18:55:37,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1149871.3333333333, ans=0.1 2023-10-12 18:55:40,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-10-12 18:56:16,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1150058.0, ans=0.0 2023-10-12 18:56:52,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1150198.0, ans=0.125 2023-10-12 18:57:05,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1150244.6666666667, ans=0.125 2023-10-12 18:57:21,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.829e+02 2.035e+02 2.382e+02 3.400e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-12 18:57:24,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1150291.3333333333, ans=0.2 2023-10-12 18:57:24,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1150291.3333333333, ans=0.125 2023-10-12 18:57:30,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1150338.0, ans=0.0 2023-10-12 18:57:32,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1150338.0, ans=0.125 2023-10-12 18:57:38,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1150384.6666666667, ans=0.1 2023-10-12 18:57:57,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1150431.3333333333, ans=0.125 2023-10-12 18:58:10,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1150478.0, ans=0.125 2023-10-12 18:58:21,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1150524.6666666667, ans=0.2 2023-10-12 18:58:24,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1150571.3333333333, ans=0.0 2023-10-12 18:58:34,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1150618.0, ans=0.125 2023-10-12 18:58:38,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1150618.0, ans=0.125 2023-10-12 18:58:45,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1150664.6666666667, ans=0.04949747468305833 2023-10-12 18:58:52,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150664.6666666667, ans=0.1 2023-10-12 18:58:54,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=12.0 2023-10-12 18:59:13,626 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=22.5 2023-10-12 18:59:15,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.768e+02 1.950e+02 2.164e+02 2.971e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-12 18:59:56,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1150898.0, ans=0.2 2023-10-12 19:00:41,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-10-12 19:00:45,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1151084.6666666667, ans=0.025 2023-10-12 19:01:04,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-12 19:01:13,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=22.5 2023-10-12 19:01:18,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.744e+02 1.908e+02 2.110e+02 3.156e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-12 19:01:18,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151224.6666666667, ans=0.0 2023-10-12 19:01:20,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1151271.3333333333, ans=0.125 2023-10-12 19:01:31,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.54 vs. limit=15.0 2023-10-12 19:02:07,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1151458.0, ans=0.125 2023-10-12 19:02:11,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1151458.0, ans=0.125 2023-10-12 19:02:24,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1151504.6666666667, ans=0.0 2023-10-12 19:02:38,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1151598.0, ans=0.125 2023-10-12 19:02:58,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1151644.6666666667, ans=0.125 2023-10-12 19:03:00,177 INFO [train.py:1031] (3/4) Epoch 19, batch 1000, loss[loss=0.1968, simple_loss=0.2898, pruned_loss=0.05187, over 16363.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2839, pruned_loss=0.05073, over 12950346.00 frames. ], batch size: 50, lr: 1.85e-03, grad_scale: 8.0 2023-10-12 19:03:05,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1151691.3333333333, ans=0.125 2023-10-12 19:03:07,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.693e+02 1.886e+02 2.082e+02 2.751e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-12 19:03:10,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151738.0, ans=0.1 2023-10-12 19:03:21,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1151784.6666666667, ans=0.125 2023-10-12 19:03:43,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1151878.0, ans=0.125 2023-10-12 19:04:14,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1152018.0, ans=0.125 2023-10-12 19:04:20,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1152018.0, ans=0.125 2023-10-12 19:04:35,765 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.35 vs. limit=22.5 2023-10-12 19:04:38,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1152111.3333333333, ans=0.125 2023-10-12 19:04:38,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1152111.3333333333, ans=0.2 2023-10-12 19:04:50,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1152158.0, ans=0.125 2023-10-12 19:04:55,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=15.0 2023-10-12 19:04:55,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.845e+02 2.010e+02 2.263e+02 2.871e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-12 19:05:13,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=12.0 2023-10-12 19:05:17,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1152251.3333333333, ans=0.0 2023-10-12 19:06:23,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1152484.6666666667, ans=0.0 2023-10-12 19:06:34,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1152531.3333333333, ans=0.1 2023-10-12 19:06:34,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=5.98 vs. limit=15.0 2023-10-12 19:06:42,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1152578.0, ans=0.125 2023-10-12 19:06:45,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1152578.0, ans=0.125 2023-10-12 19:06:49,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1152578.0, ans=0.0 2023-10-12 19:07:02,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.710e+02 1.940e+02 2.240e+02 3.269e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 19:07:13,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1152671.3333333333, ans=0.125 2023-10-12 19:07:27,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1152764.6666666667, ans=0.035 2023-10-12 19:08:06,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1152904.6666666667, ans=0.0 2023-10-12 19:08:11,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1152951.3333333333, ans=0.125 2023-10-12 19:08:14,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152951.3333333333, ans=0.1 2023-10-12 19:08:16,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1152951.3333333333, ans=0.04949747468305833 2023-10-12 19:08:23,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1152998.0, ans=0.0 2023-10-12 19:08:40,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=22.5 2023-10-12 19:08:51,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-10-12 19:08:52,350 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.766e+02 1.857e+02 2.030e+02 2.896e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-12 19:08:53,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-10-12 19:08:59,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1153138.0, ans=0.125 2023-10-12 19:09:01,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1153138.0, ans=0.0 2023-10-12 19:09:02,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1153138.0, ans=0.0 2023-10-12 19:09:20,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.45 vs. limit=15.0 2023-10-12 19:09:30,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1153278.0, ans=0.2 2023-10-12 19:09:39,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1153324.6666666667, ans=0.0 2023-10-12 19:09:47,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1153324.6666666667, ans=0.125 2023-10-12 19:09:47,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1153324.6666666667, ans=0.125 2023-10-12 19:10:28,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1153511.3333333333, ans=0.2 2023-10-12 19:10:35,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.09 vs. limit=15.0 2023-10-12 19:10:46,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1153558.0, ans=0.125 2023-10-12 19:10:48,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.712e+02 1.902e+02 2.147e+02 3.147e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-12 19:10:56,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-10-12 19:11:06,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2023-10-12 19:11:24,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1153744.6666666667, ans=0.2 2023-10-12 19:11:29,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=22.5 2023-10-12 19:11:31,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1153744.6666666667, ans=0.0 2023-10-12 19:11:38,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1153791.3333333333, ans=0.0 2023-10-12 19:11:51,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1153838.0, ans=0.0 2023-10-12 19:11:57,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1153838.0, ans=0.0 2023-10-12 19:12:00,116 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:12:19,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1153931.3333333333, ans=0.125 2023-10-12 19:12:23,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1153931.3333333333, ans=0.1 2023-10-12 19:12:26,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1153978.0, ans=0.125 2023-10-12 19:12:43,467 INFO [train.py:1031] (3/4) Epoch 19, batch 1500, loss[loss=0.1833, simple_loss=0.2737, pruned_loss=0.04649, over 16819.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2826, pruned_loss=0.05029, over 17355248.33 frames. ], batch size: 67, lr: 1.85e-03, grad_scale: 8.0 2023-10-12 19:12:47,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-10-12 19:12:48,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1154024.6666666667, ans=0.0 2023-10-12 19:12:50,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154024.6666666667, ans=0.1 2023-10-12 19:12:52,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.756e+02 1.923e+02 2.156e+02 3.062e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-12 19:13:07,953 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:13:25,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1154164.6666666667, ans=0.0 2023-10-12 19:13:26,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154164.6666666667, ans=0.1 2023-10-12 19:13:30,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-10-12 19:13:38,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154211.3333333333, ans=0.1 2023-10-12 19:14:02,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1154304.6666666667, ans=0.0 2023-10-12 19:14:30,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-10-12 19:14:37,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1154444.6666666667, ans=0.125 2023-10-12 19:14:41,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1154491.3333333333, ans=0.125 2023-10-12 19:14:48,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1154491.3333333333, ans=0.0 2023-10-12 19:14:49,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.702e+02 1.870e+02 2.140e+02 3.340e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 19:14:57,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2023-10-12 19:15:02,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1154584.6666666667, ans=0.1 2023-10-12 19:15:17,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1154631.3333333333, ans=0.1 2023-10-12 19:15:41,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1154724.6666666667, ans=0.0 2023-10-12 19:15:50,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-10-12 19:15:51,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-10-12 19:15:58,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1154771.3333333333, ans=0.125 2023-10-12 19:15:59,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1154771.3333333333, ans=0.0 2023-10-12 19:16:14,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1154864.6666666667, ans=0.2 2023-10-12 19:16:22,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1154864.6666666667, ans=0.125 2023-10-12 19:16:23,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1154864.6666666667, ans=0.0 2023-10-12 19:16:45,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1154958.0, ans=0.125 2023-10-12 19:16:47,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.767e+02 1.957e+02 2.260e+02 3.832e+02, threshold=3.914e+02, percent-clipped=1.0 2023-10-12 19:17:11,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1155098.0, ans=0.0 2023-10-12 19:17:21,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1155144.6666666667, ans=0.0 2023-10-12 19:17:23,039 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-10-12 19:17:26,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1155144.6666666667, ans=0.125 2023-10-12 19:17:54,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1155284.6666666667, ans=0.125 2023-10-12 19:18:20,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1155378.0, ans=0.0 2023-10-12 19:18:21,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1155378.0, ans=0.0 2023-10-12 19:18:22,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1155378.0, ans=0.0 2023-10-12 19:18:24,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1155378.0, ans=0.125 2023-10-12 19:18:43,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1155424.6666666667, ans=0.0 2023-10-12 19:18:45,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.728e+02 1.856e+02 2.093e+02 2.791e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-12 19:18:57,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1155518.0, ans=0.0 2023-10-12 19:19:18,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1155564.6666666667, ans=0.125 2023-10-12 19:19:43,688 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:19:58,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.83 vs. limit=22.5 2023-10-12 19:20:22,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1155844.6666666667, ans=0.125 2023-10-12 19:20:24,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1155844.6666666667, ans=0.125 2023-10-12 19:20:30,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-12 19:20:35,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-10-12 19:20:38,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=12.0 2023-10-12 19:20:39,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.733e+02 1.939e+02 2.115e+02 2.572e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-12 19:20:51,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1155984.6666666667, ans=0.125 2023-10-12 19:21:00,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=22.5 2023-10-12 19:21:17,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1156078.0, ans=0.125 2023-10-12 19:21:59,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1156218.0, ans=0.2 2023-10-12 19:22:14,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1156264.6666666667, ans=0.0 2023-10-12 19:22:14,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-10-12 19:22:18,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1156264.6666666667, ans=0.0 2023-10-12 19:22:26,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1156311.3333333333, ans=0.125 2023-10-12 19:22:33,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-10-12 19:22:39,434 INFO [train.py:1031] (3/4) Epoch 19, batch 2000, loss[loss=0.1877, simple_loss=0.2862, pruned_loss=0.04464, over 16503.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2831, pruned_loss=0.05033, over 20770443.13 frames. ], batch size: 266, lr: 1.84e-03, grad_scale: 32.0 2023-10-12 19:22:50,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.774e+02 1.935e+02 2.138e+02 2.712e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 19:23:06,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1156451.3333333333, ans=0.125 2023-10-12 19:23:21,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1156498.0, ans=0.0 2023-10-12 19:23:28,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1156498.0, ans=0.125 2023-10-12 19:23:55,270 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:24:00,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1156591.3333333333, ans=0.0 2023-10-12 19:24:11,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-12 19:24:16,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156684.6666666667, ans=0.1 2023-10-12 19:24:17,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1156684.6666666667, ans=0.125 2023-10-12 19:24:30,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1156731.3333333333, ans=0.0 2023-10-12 19:24:30,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1156731.3333333333, ans=0.125 2023-10-12 19:24:43,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1156731.3333333333, ans=0.125 2023-10-12 19:24:49,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1156778.0, ans=0.125 2023-10-12 19:24:59,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1156824.6666666667, ans=0.125 2023-10-12 19:25:19,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.723e+02 1.888e+02 2.197e+02 3.185e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 19:25:41,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2023-10-12 19:25:48,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1156918.0, ans=0.125 2023-10-12 19:26:02,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156964.6666666667, ans=0.1 2023-10-12 19:26:22,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-10-12 19:26:27,545 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-10-12 19:26:27,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1157058.0, ans=0.0 2023-10-12 19:26:31,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1157058.0, ans=0.125 2023-10-12 19:26:44,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1157104.6666666667, ans=0.07 2023-10-12 19:26:59,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1157198.0, ans=0.0 2023-10-12 19:27:02,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1157198.0, ans=0.0 2023-10-12 19:27:05,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1157198.0, ans=0.125 2023-10-12 19:27:08,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1157198.0, ans=0.2 2023-10-12 19:27:15,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1157244.6666666667, ans=0.125 2023-10-12 19:27:30,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1157291.3333333333, ans=0.125 2023-10-12 19:27:32,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1157291.3333333333, ans=0.2 2023-10-12 19:27:39,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.810e+02 2.043e+02 2.281e+02 3.025e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-12 19:27:47,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1157338.0, ans=0.07 2023-10-12 19:27:51,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1157384.6666666667, ans=0.1 2023-10-12 19:28:20,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1157478.0, ans=0.95 2023-10-12 19:28:31,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-10-12 19:28:37,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157571.3333333333, ans=0.1 2023-10-12 19:28:57,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157618.0, ans=0.1 2023-10-12 19:29:18,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1157711.3333333333, ans=0.125 2023-10-12 19:29:33,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.861e+02 2.037e+02 2.495e+02 4.157e+02, threshold=4.073e+02, percent-clipped=1.0 2023-10-12 19:29:53,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1157898.0, ans=0.2 2023-10-12 19:30:17,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1157991.3333333333, ans=0.2 2023-10-12 19:30:18,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-10-12 19:30:31,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1158038.0, ans=0.125 2023-10-12 19:30:34,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1158038.0, ans=0.125 2023-10-12 19:30:50,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2023-10-12 19:31:07,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1158178.0, ans=0.125 2023-10-12 19:31:13,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=10.0 2023-10-12 19:31:24,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1158224.6666666667, ans=0.1 2023-10-12 19:31:29,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.810e+02 1.946e+02 2.126e+02 2.830e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-12 19:31:53,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1158364.6666666667, ans=0.125 2023-10-12 19:31:59,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1158364.6666666667, ans=0.2 2023-10-12 19:32:44,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1158551.3333333333, ans=0.125 2023-10-12 19:32:45,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1158598.0, ans=0.125 2023-10-12 19:32:51,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1158598.0, ans=0.125 2023-10-12 19:33:09,006 INFO [train.py:1031] (3/4) Epoch 19, batch 2500, loss[loss=0.2414, simple_loss=0.3061, pruned_loss=0.08834, over 15647.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2837, pruned_loss=0.05078, over 23439938.46 frames. ], batch size: 350, lr: 1.84e-03, grad_scale: 16.0 2023-10-12 19:33:09,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2023-10-12 19:33:22,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.734e+02 1.883e+02 2.082e+02 3.087e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 19:33:29,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1158738.0, ans=0.1 2023-10-12 19:33:53,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1158878.0, ans=0.125 2023-10-12 19:34:09,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.71 vs. limit=22.5 2023-10-12 19:34:10,333 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:34:20,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1158971.3333333333, ans=0.1 2023-10-12 19:34:35,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1159064.6666666667, ans=0.125 2023-10-12 19:34:50,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-12 19:34:52,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1159111.3333333333, ans=0.05 2023-10-12 19:35:08,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.816e+02 1.986e+02 2.145e+02 3.249e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-12 19:35:23,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1159251.3333333333, ans=0.2 2023-10-12 19:36:01,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-10-12 19:36:04,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1159438.0, ans=0.0 2023-10-12 19:36:06,208 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=15.0 2023-10-12 19:36:07,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1159438.0, ans=0.0 2023-10-12 19:36:11,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1159438.0, ans=0.125 2023-10-12 19:36:37,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1159531.3333333333, ans=0.0 2023-10-12 19:37:09,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.741e+02 1.953e+02 2.159e+02 3.231e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-12 19:37:11,932 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.527e-03 2023-10-12 19:37:24,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.0 2023-10-12 19:37:26,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1159718.0, ans=0.125 2023-10-12 19:37:29,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1159764.6666666667, ans=0.0 2023-10-12 19:37:30,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1159764.6666666667, ans=0.0 2023-10-12 19:37:35,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1159764.6666666667, ans=0.125 2023-10-12 19:37:38,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1159764.6666666667, ans=0.125 2023-10-12 19:38:05,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1159858.0, ans=0.0 2023-10-12 19:38:33,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-10-12 19:39:04,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-12 19:39:12,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.712e+02 1.959e+02 2.214e+02 3.268e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 19:39:25,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1160138.0, ans=0.1 2023-10-12 19:39:45,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.67 vs. limit=15.0 2023-10-12 19:40:04,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1160278.0, ans=0.0 2023-10-12 19:40:19,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1160324.6666666667, ans=0.125 2023-10-12 19:40:24,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1160371.3333333333, ans=0.125 2023-10-12 19:40:34,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.41 vs. limit=12.0 2023-10-12 19:40:49,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1160418.0, ans=0.125 2023-10-12 19:40:54,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1160464.6666666667, ans=0.0 2023-10-12 19:41:06,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1160511.3333333333, ans=0.035 2023-10-12 19:41:11,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1160511.3333333333, ans=0.1 2023-10-12 19:41:21,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.58 vs. limit=15.0 2023-10-12 19:41:28,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.44 vs. limit=15.0 2023-10-12 19:41:29,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.745e+02 1.895e+02 2.031e+02 2.974e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-12 19:41:31,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1160604.6666666667, ans=0.0 2023-10-12 19:41:47,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1160651.3333333333, ans=0.125 2023-10-12 19:42:04,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.93 vs. limit=15.0 2023-10-12 19:42:09,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1160744.6666666667, ans=0.125 2023-10-12 19:42:16,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1160791.3333333333, ans=0.95 2023-10-12 19:42:26,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1160838.0, ans=0.0 2023-10-12 19:42:42,747 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:42:43,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1160931.3333333333, ans=0.125 2023-10-12 19:43:05,382 INFO [train.py:1031] (3/4) Epoch 19, batch 3000, loss[loss=0.1893, simple_loss=0.277, pruned_loss=0.05086, over 16619.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2827, pruned_loss=0.05067, over 25492598.02 frames. ], batch size: 66, lr: 1.84e-03, grad_scale: 16.0 2023-10-12 19:43:09,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1161024.6666666667, ans=0.1 2023-10-12 19:43:18,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.729e+02 1.936e+02 2.188e+02 3.279e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-12 19:43:58,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1161211.3333333333, ans=0.2 2023-10-12 19:44:13,375 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-10-12 19:44:15,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1161304.6666666667, ans=0.0 2023-10-12 19:44:27,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161351.3333333333, ans=0.1 2023-10-12 19:44:34,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-12 19:44:39,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1161398.0, ans=0.2 2023-10-12 19:45:00,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.77 vs. limit=22.5 2023-10-12 19:45:13,798 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:45:18,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.825e+02 2.031e+02 2.305e+02 3.403e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-12 19:45:20,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161538.0, ans=0.1 2023-10-12 19:45:29,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1161584.6666666667, ans=0.125 2023-10-12 19:45:30,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1161584.6666666667, ans=0.0 2023-10-12 19:45:32,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1161584.6666666667, ans=0.125 2023-10-12 19:45:57,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=12.0 2023-10-12 19:46:08,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1161724.6666666667, ans=0.2 2023-10-12 19:46:17,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1161771.3333333333, ans=0.125 2023-10-12 19:46:45,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161911.3333333333, ans=0.1 2023-10-12 19:46:47,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1161911.3333333333, ans=0.2 2023-10-12 19:46:57,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1161958.0, ans=0.0 2023-10-12 19:46:59,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.09 vs. limit=10.0 2023-10-12 19:47:02,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1161958.0, ans=0.125 2023-10-12 19:47:06,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-10-12 19:47:09,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.696e+02 1.877e+02 2.136e+02 3.018e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 19:47:12,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=22.5 2023-10-12 19:47:14,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162004.6666666667, ans=0.1 2023-10-12 19:47:38,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1162098.0, ans=0.0 2023-10-12 19:47:54,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-10-12 19:48:27,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1162284.6666666667, ans=0.125 2023-10-12 19:48:36,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1162284.6666666667, ans=0.125 2023-10-12 19:48:41,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1162331.3333333333, ans=0.125 2023-10-12 19:48:46,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1162331.3333333333, ans=0.1 2023-10-12 19:48:48,907 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-12 19:48:52,041 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:49:15,769 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.713e+02 1.887e+02 2.016e+02 3.327e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 19:49:16,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1162471.3333333333, ans=0.125 2023-10-12 19:49:17,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1162471.3333333333, ans=0.125 2023-10-12 19:49:19,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-10-12 19:49:31,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1162518.0, ans=0.0 2023-10-12 19:49:48,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1162611.3333333333, ans=0.0 2023-10-12 19:50:13,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1162704.6666666667, ans=0.5 2023-10-12 19:50:17,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-10-12 19:50:22,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1162704.6666666667, ans=0.125 2023-10-12 19:50:22,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1162704.6666666667, ans=6.0 2023-10-12 19:50:26,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-10-12 19:51:06,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1162891.3333333333, ans=0.0 2023-10-12 19:51:18,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.748e+02 1.905e+02 2.115e+02 2.972e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 19:51:21,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1162938.0, ans=0.125 2023-10-12 19:51:30,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1162984.6666666667, ans=0.2 2023-10-12 19:51:53,886 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-12 19:51:58,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163078.0, ans=0.1 2023-10-12 19:52:02,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1163124.6666666667, ans=0.125 2023-10-12 19:52:08,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-12 19:52:20,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-10-12 19:52:24,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1163218.0, ans=0.125 2023-10-12 19:52:25,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1163218.0, ans=0.5 2023-10-12 19:52:59,194 INFO [train.py:1031] (3/4) Epoch 19, batch 3500, loss[loss=0.1892, simple_loss=0.2745, pruned_loss=0.05195, over 16932.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2825, pruned_loss=0.05076, over 27062638.82 frames. ], batch size: 77, lr: 1.84e-03, grad_scale: 16.0 2023-10-12 19:53:01,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1163358.0, ans=0.2 2023-10-12 19:53:01,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1163358.0, ans=0.125 2023-10-12 19:53:14,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.696e+02 1.857e+02 2.050e+02 4.486e+02, threshold=3.715e+02, percent-clipped=1.0 2023-10-12 19:53:17,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1163404.6666666667, ans=0.1 2023-10-12 19:53:30,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=22.5 2023-10-12 19:53:42,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1163498.0, ans=0.125 2023-10-12 19:53:50,323 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:54:08,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1163591.3333333333, ans=0.0 2023-10-12 19:54:22,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-12 19:54:38,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163684.6666666667, ans=0.125 2023-10-12 19:54:52,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-10-12 19:55:15,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1163824.6666666667, ans=0.1 2023-10-12 19:55:22,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.780e+02 1.941e+02 2.125e+02 2.998e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-12 19:55:31,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1163918.0, ans=0.125 2023-10-12 19:55:57,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-10-12 19:56:06,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.19 vs. limit=12.0 2023-10-12 19:56:29,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1164151.3333333333, ans=0.2 2023-10-12 19:56:34,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-10-12 19:56:38,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1164198.0, ans=0.0 2023-10-12 19:56:45,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1164198.0, ans=0.1 2023-10-12 19:57:17,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1164338.0, ans=0.5 2023-10-12 19:57:19,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.735e+02 1.911e+02 2.167e+02 2.897e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-12 19:57:48,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1164384.6666666667, ans=0.0 2023-10-12 19:57:55,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1164431.3333333333, ans=0.07 2023-10-12 19:58:23,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1164571.3333333333, ans=0.0 2023-10-12 19:58:29,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1164571.3333333333, ans=0.125 2023-10-12 19:58:31,522 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:58:31,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1164571.3333333333, ans=0.2 2023-10-12 19:58:38,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1164618.0, ans=0.0 2023-10-12 19:58:50,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=15.0 2023-10-12 19:59:01,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1164664.6666666667, ans=0.125 2023-10-12 19:59:13,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1164758.0, ans=0.0 2023-10-12 19:59:23,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1164758.0, ans=0.125 2023-10-12 19:59:28,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.769e+02 1.898e+02 2.114e+02 2.786e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-12 19:59:45,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1164851.3333333333, ans=0.2 2023-10-12 20:00:00,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1164944.6666666667, ans=0.125 2023-10-12 20:00:00,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1164944.6666666667, ans=0.07 2023-10-12 20:00:07,760 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:00:11,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1164991.3333333333, ans=22.5 2023-10-12 20:00:38,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1165084.6666666667, ans=0.0 2023-10-12 20:01:18,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=15.0 2023-10-12 20:01:26,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.729e+02 1.960e+02 2.149e+02 3.461e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-12 20:01:45,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1165364.6666666667, ans=0.125 2023-10-12 20:01:51,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1165364.6666666667, ans=0.0 2023-10-12 20:01:58,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1165411.3333333333, ans=0.1 2023-10-12 20:02:05,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1165458.0, ans=0.125 2023-10-12 20:02:12,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-10-12 20:02:24,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1165504.6666666667, ans=0.125 2023-10-12 20:02:45,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-12 20:02:58,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1165644.6666666667, ans=0.5 2023-10-12 20:03:01,860 INFO [train.py:1031] (3/4) Epoch 19, batch 4000, loss[loss=0.1837, simple_loss=0.2764, pruned_loss=0.04552, over 16127.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2821, pruned_loss=0.05073, over 28320188.83 frames. ], batch size: 43, lr: 1.84e-03, grad_scale: 32.0 2023-10-12 20:03:19,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1165738.0, ans=0.1 2023-10-12 20:03:21,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1165738.0, ans=0.1 2023-10-12 20:03:21,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.753e+02 1.923e+02 2.126e+02 2.945e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-12 20:03:28,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-12 20:03:36,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=22.5 2023-10-12 20:03:42,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1165831.3333333333, ans=0.125 2023-10-12 20:03:55,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1165878.0, ans=0.0 2023-10-12 20:04:09,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1165924.6666666667, ans=0.125 2023-10-12 20:04:15,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1165971.3333333333, ans=0.0 2023-10-12 20:04:55,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1166111.3333333333, ans=0.2 2023-10-12 20:04:59,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1166158.0, ans=0.125 2023-10-12 20:05:00,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.25 vs. limit=15.0 2023-10-12 20:05:14,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.859e+02 1.970e+02 2.288e+02 3.382e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-12 20:05:15,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1166204.6666666667, ans=0.125 2023-10-12 20:05:25,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1166251.3333333333, ans=0.1 2023-10-12 20:05:35,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1166298.0, ans=0.125 2023-10-12 20:06:13,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1166438.0, ans=0.125 2023-10-12 20:06:30,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1166484.6666666667, ans=0.1 2023-10-12 20:06:45,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1166531.3333333333, ans=0.125 2023-10-12 20:06:52,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=22.5 2023-10-12 20:06:59,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1166578.0, ans=0.02 2023-10-12 20:07:25,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.789e+02 1.937e+02 2.231e+02 2.968e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-12 20:07:31,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1166718.0, ans=0.125 2023-10-12 20:07:58,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1166764.6666666667, ans=0.125 2023-10-12 20:08:01,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1166764.6666666667, ans=0.5 2023-10-12 20:08:04,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1166811.3333333333, ans=0.07 2023-10-12 20:08:08,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1166811.3333333333, ans=0.0 2023-10-12 20:08:11,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1166811.3333333333, ans=0.0 2023-10-12 20:08:23,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1166858.0, ans=0.1 2023-10-12 20:08:35,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1166904.6666666667, ans=0.125 2023-10-12 20:08:54,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1166998.0, ans=0.0 2023-10-12 20:09:10,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1167044.6666666667, ans=0.07 2023-10-12 20:09:23,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1167091.3333333333, ans=0.2 2023-10-12 20:09:27,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1167138.0, ans=0.125 2023-10-12 20:09:31,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1167138.0, ans=0.0 2023-10-12 20:09:32,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.783e+02 1.908e+02 2.066e+02 2.889e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-12 20:09:47,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=1167231.3333333333, ans=22.5 2023-10-12 20:09:57,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1167278.0, ans=0.125 2023-10-12 20:10:19,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1167371.3333333333, ans=0.2 2023-10-12 20:10:47,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1167464.6666666667, ans=0.125 2023-10-12 20:10:54,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1167464.6666666667, ans=0.5 2023-10-12 20:11:19,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1167604.6666666667, ans=0.125 2023-10-12 20:11:27,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.820e+02 1.974e+02 2.184e+02 2.854e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-12 20:11:40,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1167651.3333333333, ans=0.125 2023-10-12 20:11:45,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1167698.0, ans=0.0 2023-10-12 20:12:03,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1167744.6666666667, ans=0.0 2023-10-12 20:12:03,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2023-10-12 20:12:35,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1167838.0, ans=0.125 2023-10-12 20:12:45,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1167884.6666666667, ans=0.0 2023-10-12 20:12:59,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.25 vs. limit=15.0 2023-10-12 20:13:02,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1167978.0, ans=0.125 2023-10-12 20:13:09,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1167978.0, ans=0.125 2023-10-12 20:13:12,702 INFO [train.py:1031] (3/4) Epoch 19, batch 4500, loss[loss=0.1872, simple_loss=0.2762, pruned_loss=0.04907, over 16057.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2826, pruned_loss=0.05051, over 29335691.76 frames. ], batch size: 296, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 20:13:13,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1168024.6666666667, ans=0.0 2023-10-12 20:13:30,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.720e+02 1.897e+02 2.084e+02 2.769e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 20:13:36,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-10-12 20:13:44,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1168164.6666666667, ans=0.125 2023-10-12 20:13:46,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-10-12 20:14:06,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1168211.3333333333, ans=0.1 2023-10-12 20:14:28,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1168304.6666666667, ans=0.0 2023-10-12 20:15:06,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1168491.3333333333, ans=0.05 2023-10-12 20:15:10,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1168491.3333333333, ans=0.125 2023-10-12 20:15:13,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1168538.0, ans=0.125 2023-10-12 20:15:19,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1168538.0, ans=0.07 2023-10-12 20:15:20,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.764e+02 1.964e+02 2.300e+02 3.011e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 20:15:20,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1168538.0, ans=0.125 2023-10-12 20:15:36,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1168631.3333333333, ans=0.07 2023-10-12 20:15:43,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1168631.3333333333, ans=0.125 2023-10-12 20:15:43,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1168631.3333333333, ans=0.0 2023-10-12 20:15:54,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1168678.0, ans=0.0 2023-10-12 20:15:54,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1168678.0, ans=0.125 2023-10-12 20:15:56,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1168724.6666666667, ans=0.125 2023-10-12 20:15:59,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1168724.6666666667, ans=0.04949747468305833 2023-10-12 20:16:00,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1168724.6666666667, ans=0.125 2023-10-12 20:16:10,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1168771.3333333333, ans=0.125 2023-10-12 20:16:48,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1168911.3333333333, ans=0.0 2023-10-12 20:16:50,662 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:17:20,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.738e+02 1.884e+02 2.098e+02 2.757e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-12 20:17:28,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1169051.3333333333, ans=0.125 2023-10-12 20:17:42,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1169098.0, ans=0.2 2023-10-12 20:17:52,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-10-12 20:17:53,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1169144.6666666667, ans=0.0 2023-10-12 20:17:59,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=15.0 2023-10-12 20:18:03,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1169191.3333333333, ans=0.0 2023-10-12 20:18:03,210 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:18:18,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1169238.0, ans=0.1 2023-10-12 20:18:30,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1169331.3333333333, ans=0.0 2023-10-12 20:19:10,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-10-12 20:19:12,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.663e+02 1.834e+02 2.005e+02 2.813e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-12 20:19:14,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-10-12 20:19:15,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2023-10-12 20:20:01,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1169658.0, ans=0.125 2023-10-12 20:20:23,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1169751.3333333333, ans=0.125 2023-10-12 20:20:25,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1169751.3333333333, ans=0.125 2023-10-12 20:20:28,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1169798.0, ans=0.2 2023-10-12 20:20:35,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=15.0 2023-10-12 20:20:37,596 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:20:49,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1169891.3333333333, ans=0.0 2023-10-12 20:20:57,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-12 20:21:09,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.788e+02 1.990e+02 2.211e+02 3.307e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-12 20:21:10,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1169938.0, ans=6.0 2023-10-12 20:21:13,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-10-12 20:21:18,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1169984.6666666667, ans=0.0 2023-10-12 20:21:23,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=22.5 2023-10-12 20:21:27,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1170031.3333333333, ans=0.0 2023-10-12 20:21:54,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1170124.6666666667, ans=0.125 2023-10-12 20:22:08,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.52 vs. limit=22.5 2023-10-12 20:22:21,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170218.0, ans=0.1 2023-10-12 20:22:38,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1170311.3333333333, ans=0.0 2023-10-12 20:22:45,289 INFO [train.py:1031] (3/4) Epoch 19, batch 5000, loss[loss=0.2198, simple_loss=0.3021, pruned_loss=0.06876, over 16408.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2825, pruned_loss=0.05063, over 30096185.80 frames. ], batch size: 50, lr: 1.83e-03, grad_scale: 32.0 2023-10-12 20:22:50,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1170358.0, ans=0.0 2023-10-12 20:23:04,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.782e+02 1.990e+02 2.188e+02 3.191e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-12 20:23:12,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1170451.3333333333, ans=0.125 2023-10-12 20:23:13,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1170451.3333333333, ans=0.125 2023-10-12 20:23:47,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1170591.3333333333, ans=0.0 2023-10-12 20:23:50,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1170591.3333333333, ans=0.125 2023-10-12 20:23:51,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1170638.0, ans=0.125 2023-10-12 20:24:03,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1170684.6666666667, ans=0.025 2023-10-12 20:24:23,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1170731.3333333333, ans=0.04949747468305833 2023-10-12 20:24:30,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1170778.0, ans=0.125 2023-10-12 20:24:32,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.48 vs. limit=22.5 2023-10-12 20:24:45,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1170824.6666666667, ans=0.125 2023-10-12 20:24:54,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1170871.3333333333, ans=0.125 2023-10-12 20:25:01,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.757e+02 1.915e+02 2.080e+02 3.362e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-12 20:25:07,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1170918.0, ans=0.125 2023-10-12 20:25:14,961 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:25:37,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1171011.3333333333, ans=0.0 2023-10-12 20:25:46,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=22.5 2023-10-12 20:25:51,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1171104.6666666667, ans=0.1 2023-10-12 20:26:00,549 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-10-12 20:26:35,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1171291.3333333333, ans=0.2 2023-10-12 20:26:38,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1171291.3333333333, ans=0.125 2023-10-12 20:26:42,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1171291.3333333333, ans=0.2 2023-10-12 20:26:47,501 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:26:52,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-10-12 20:26:53,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.794e+02 1.947e+02 2.243e+02 2.906e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-12 20:26:59,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1171384.6666666667, ans=0.125 2023-10-12 20:26:59,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1171384.6666666667, ans=0.0 2023-10-12 20:27:00,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1171384.6666666667, ans=0.0 2023-10-12 20:27:25,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1171478.0, ans=0.125 2023-10-12 20:27:30,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.66 vs. limit=15.0 2023-10-12 20:27:31,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-12 20:27:51,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171571.3333333333, ans=0.1 2023-10-12 20:28:08,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1171664.6666666667, ans=0.0 2023-10-12 20:28:15,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1171664.6666666667, ans=0.0 2023-10-12 20:28:16,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1171664.6666666667, ans=0.125 2023-10-12 20:28:22,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1171711.3333333333, ans=0.2 2023-10-12 20:28:34,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.25 vs. limit=15.0 2023-10-12 20:28:43,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1171804.6666666667, ans=10.0 2023-10-12 20:28:46,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.92 vs. limit=6.0 2023-10-12 20:28:50,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.677e+02 1.877e+02 2.102e+02 2.700e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-12 20:29:15,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1171898.0, ans=0.125 2023-10-12 20:29:29,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1171944.6666666667, ans=0.125 2023-10-12 20:29:35,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1171991.3333333333, ans=0.0 2023-10-12 20:29:57,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1172038.0, ans=0.0 2023-10-12 20:30:15,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1172084.6666666667, ans=0.0 2023-10-12 20:30:30,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1172131.3333333333, ans=12.0 2023-10-12 20:30:48,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1172224.6666666667, ans=0.2 2023-10-12 20:31:05,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.718e+02 1.906e+02 2.162e+02 3.312e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 20:31:16,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1172318.0, ans=0.0 2023-10-12 20:31:34,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1172411.3333333333, ans=0.0 2023-10-12 20:32:00,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.37 vs. limit=15.0 2023-10-12 20:32:20,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1172504.6666666667, ans=0.1 2023-10-12 20:32:21,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1172504.6666666667, ans=0.125 2023-10-12 20:32:24,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.69 vs. limit=10.0 2023-10-12 20:32:38,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=22.5 2023-10-12 20:32:51,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-12 20:32:56,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1172644.6666666667, ans=0.1 2023-10-12 20:32:59,646 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:33:01,866 INFO [train.py:1031] (3/4) Epoch 19, batch 5500, loss[loss=0.1795, simple_loss=0.2678, pruned_loss=0.04567, over 16092.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2823, pruned_loss=0.05033, over 30742143.89 frames. ], batch size: 43, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 20:33:02,514 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.11 vs. limit=15.0 2023-10-12 20:33:08,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1172691.3333333333, ans=15.0 2023-10-12 20:33:13,882 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:33:18,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-12 20:33:25,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.702e+02 1.922e+02 2.285e+02 3.676e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 20:33:33,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1172784.6666666667, ans=0.125 2023-10-12 20:33:34,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1172784.6666666667, ans=0.125 2023-10-12 20:33:37,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1172831.3333333333, ans=0.0 2023-10-12 20:33:54,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1172878.0, ans=0.2 2023-10-12 20:33:54,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=12.0 2023-10-12 20:33:59,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1172924.6666666667, ans=0.125 2023-10-12 20:34:14,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.24 vs. limit=15.0 2023-10-12 20:34:55,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1173158.0, ans=15.0 2023-10-12 20:35:13,454 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=22.5 2023-10-12 20:35:16,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.820e+02 2.005e+02 2.215e+02 3.353e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-12 20:35:55,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1173344.6666666667, ans=0.0 2023-10-12 20:36:14,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1173438.0, ans=0.0 2023-10-12 20:36:15,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1173438.0, ans=0.1 2023-10-12 20:36:16,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1173438.0, ans=0.0 2023-10-12 20:36:21,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1173484.6666666667, ans=0.125 2023-10-12 20:36:21,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1173484.6666666667, ans=10.0 2023-10-12 20:36:45,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=8.63 vs. limit=22.5 2023-10-12 20:37:04,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1173624.6666666667, ans=10.0 2023-10-12 20:37:15,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.714e+02 1.871e+02 2.033e+02 2.729e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-12 20:37:21,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.17 vs. limit=10.0 2023-10-12 20:37:31,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1173764.6666666667, ans=0.0 2023-10-12 20:37:46,069 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-10-12 20:38:02,033 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:38:10,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1173904.6666666667, ans=0.125 2023-10-12 20:38:16,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1173951.3333333333, ans=0.125 2023-10-12 20:38:21,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1173951.3333333333, ans=0.2 2023-10-12 20:38:34,531 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.79 vs. limit=15.0 2023-10-12 20:38:37,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1174044.6666666667, ans=0.125 2023-10-12 20:38:38,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1174044.6666666667, ans=0.0 2023-10-12 20:38:56,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=15.0 2023-10-12 20:39:10,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1174138.0, ans=0.0 2023-10-12 20:39:12,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.781e+02 2.002e+02 2.350e+02 3.222e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-12 20:40:19,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1174418.0, ans=0.2 2023-10-12 20:40:55,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174558.0, ans=0.1 2023-10-12 20:41:16,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1174604.6666666667, ans=0.125 2023-10-12 20:41:19,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.793e+02 1.964e+02 2.220e+02 2.804e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 20:41:30,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=15.0 2023-10-12 20:41:40,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1174698.0, ans=0.0 2023-10-12 20:41:41,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174698.0, ans=0.1 2023-10-12 20:42:03,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.38 vs. limit=15.0 2023-10-12 20:42:04,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1174791.3333333333, ans=0.2 2023-10-12 20:42:07,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=22.5 2023-10-12 20:42:11,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1174838.0, ans=0.2 2023-10-12 20:42:12,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1174838.0, ans=0.125 2023-10-12 20:42:55,484 INFO [train.py:1031] (3/4) Epoch 19, batch 6000, loss[loss=0.202, simple_loss=0.2914, pruned_loss=0.05635, over 16579.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2827, pruned_loss=0.05068, over 31191508.32 frames. ], batch size: 219, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 20:43:01,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1175024.6666666667, ans=0.125 2023-10-12 20:43:13,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-10-12 20:43:17,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1175071.3333333333, ans=0.025 2023-10-12 20:43:20,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.830e+02 2.020e+02 2.266e+02 4.107e+02, threshold=4.040e+02, percent-clipped=2.0 2023-10-12 20:43:32,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.12 vs. limit=15.0 2023-10-12 20:43:32,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1175164.6666666667, ans=0.2 2023-10-12 20:43:35,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1175164.6666666667, ans=0.5 2023-10-12 20:43:40,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2023-10-12 20:43:50,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1175211.3333333333, ans=0.1 2023-10-12 20:44:04,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1175258.0, ans=0.2 2023-10-12 20:44:09,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1175304.6666666667, ans=0.0 2023-10-12 20:44:37,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.01 vs. limit=15.0 2023-10-12 20:44:48,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1175444.6666666667, ans=0.125 2023-10-12 20:45:13,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.719e+02 1.859e+02 2.078e+02 3.561e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-12 20:45:29,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1175631.3333333333, ans=0.125 2023-10-12 20:45:35,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175631.3333333333, ans=0.1 2023-10-12 20:46:00,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1175771.3333333333, ans=0.125 2023-10-12 20:46:06,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1175771.3333333333, ans=0.125 2023-10-12 20:46:57,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1176004.6666666667, ans=0.125 2023-10-12 20:47:03,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1176004.6666666667, ans=0.125 2023-10-12 20:47:08,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.842e+02 2.003e+02 2.220e+02 3.919e+02, threshold=4.007e+02, percent-clipped=1.0 2023-10-12 20:47:17,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-12 20:47:25,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.32 vs. limit=10.0 2023-10-12 20:47:40,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1176144.6666666667, ans=0.125 2023-10-12 20:47:44,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1176144.6666666667, ans=0.1 2023-10-12 20:47:48,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1176191.3333333333, ans=0.1 2023-10-12 20:48:02,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-12 20:48:15,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-12 20:48:33,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-10-12 20:48:42,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1176378.0, ans=0.0 2023-10-12 20:48:46,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1176424.6666666667, ans=0.125 2023-10-12 20:48:48,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1176424.6666666667, ans=0.125 2023-10-12 20:48:57,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-10-12 20:48:58,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1176471.3333333333, ans=15.0 2023-10-12 20:49:08,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.867e+02 2.076e+02 2.307e+02 3.120e+02, threshold=4.153e+02, percent-clipped=0.0 2023-10-12 20:49:10,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2023-10-12 20:49:37,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1176611.3333333333, ans=0.1 2023-10-12 20:50:34,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-10-12 20:51:14,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.642e+02 1.821e+02 2.068e+02 2.710e+02, threshold=3.642e+02, percent-clipped=0.0 2023-10-12 20:51:20,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1176984.6666666667, ans=0.125 2023-10-12 20:51:30,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1177031.3333333333, ans=0.0 2023-10-12 20:51:32,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.52 vs. limit=15.0 2023-10-12 20:51:33,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1177031.3333333333, ans=0.0 2023-10-12 20:51:36,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1177078.0, ans=0.09899494936611666 2023-10-12 20:51:43,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1177078.0, ans=0.125 2023-10-12 20:51:53,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1177124.6666666667, ans=0.125 2023-10-12 20:52:06,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1177171.3333333333, ans=0.2 2023-10-12 20:52:44,414 INFO [train.py:1031] (3/4) Epoch 19, batch 6500, loss[loss=0.2673, simple_loss=0.3328, pruned_loss=0.1009, over 15573.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2831, pruned_loss=0.05076, over 31544589.97 frames. ], batch size: 350, lr: 1.83e-03, grad_scale: 32.0 2023-10-12 20:52:44,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1177358.0, ans=0.2 2023-10-12 20:52:44,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1177358.0, ans=0.0 2023-10-12 20:52:47,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.86 vs. limit=15.0 2023-10-12 20:53:10,616 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.818e+02 1.976e+02 2.310e+02 3.595e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-12 20:53:10,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1177451.3333333333, ans=0.125 2023-10-12 20:53:16,170 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:53:45,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1177544.6666666667, ans=0.5 2023-10-12 20:54:00,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1177591.3333333333, ans=0.2 2023-10-12 20:54:03,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-10-12 20:54:08,358 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.67 vs. limit=15.0 2023-10-12 20:54:15,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177684.6666666667, ans=0.1 2023-10-12 20:54:18,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1177684.6666666667, ans=0.0 2023-10-12 20:54:27,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177731.3333333333, ans=0.1 2023-10-12 20:54:31,977 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:54:35,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-10-12 20:54:38,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-12 20:54:44,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177778.0, ans=0.1 2023-10-12 20:54:48,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1177824.6666666667, ans=0.1 2023-10-12 20:54:58,049 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:55:00,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1177871.3333333333, ans=0.125 2023-10-12 20:55:08,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.788e+02 1.992e+02 2.194e+02 2.769e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-12 20:55:16,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1177918.0, ans=0.95 2023-10-12 20:55:18,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2023-10-12 20:55:24,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1177964.6666666667, ans=0.125 2023-10-12 20:55:25,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1177964.6666666667, ans=0.05 2023-10-12 20:55:26,347 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:55:26,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1177964.6666666667, ans=0.0 2023-10-12 20:55:47,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-10-12 20:55:49,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2023-10-12 20:56:02,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1178151.3333333333, ans=0.125 2023-10-12 20:56:03,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1178151.3333333333, ans=0.125 2023-10-12 20:56:13,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=15.0 2023-10-12 20:56:15,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1178198.0, ans=0.125 2023-10-12 20:56:15,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.26 vs. limit=6.0 2023-10-12 20:56:30,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1178244.6666666667, ans=0.0 2023-10-12 20:56:42,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-10-12 20:56:55,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.751e+02 1.958e+02 2.215e+02 3.116e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-12 20:56:58,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1178384.6666666667, ans=0.0 2023-10-12 20:57:04,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1178384.6666666667, ans=0.125 2023-10-12 20:58:02,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1178618.0, ans=0.125 2023-10-12 20:58:28,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1178711.3333333333, ans=0.0 2023-10-12 20:58:32,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-10-12 20:58:46,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1178758.0, ans=0.1 2023-10-12 20:58:46,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1178758.0, ans=0.0 2023-10-12 20:58:54,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1178804.6666666667, ans=15.0 2023-10-12 20:58:56,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1178804.6666666667, ans=0.125 2023-10-12 20:58:58,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1178804.6666666667, ans=0.125 2023-10-12 20:58:58,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1178804.6666666667, ans=0.2 2023-10-12 20:59:01,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1178804.6666666667, ans=0.125 2023-10-12 20:59:01,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1178804.6666666667, ans=0.125 2023-10-12 20:59:06,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.698e+02 1.899e+02 2.103e+02 3.320e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-12 20:59:13,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1178851.3333333333, ans=0.0 2023-10-12 20:59:34,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1178944.6666666667, ans=0.125 2023-10-12 20:59:41,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1178991.3333333333, ans=0.09899494936611666 2023-10-12 20:59:50,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1178991.3333333333, ans=0.125 2023-10-12 21:00:07,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1179084.6666666667, ans=0.1 2023-10-12 21:00:08,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1179084.6666666667, ans=0.0 2023-10-12 21:00:08,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1179084.6666666667, ans=0.125 2023-10-12 21:00:13,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1179084.6666666667, ans=0.0 2023-10-12 21:00:14,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1179084.6666666667, ans=0.2 2023-10-12 21:00:36,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179178.0, ans=0.1 2023-10-12 21:00:43,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1179224.6666666667, ans=0.04949747468305833 2023-10-12 21:00:47,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.67 vs. limit=22.5 2023-10-12 21:01:01,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1179271.3333333333, ans=0.0 2023-10-12 21:01:05,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.664e+02 1.844e+02 1.997e+02 2.539e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-12 21:01:13,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1179318.0, ans=0.2 2023-10-12 21:01:17,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-10-12 21:01:30,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1179411.3333333333, ans=0.0 2023-10-12 21:01:34,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.96 vs. limit=15.0 2023-10-12 21:01:39,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1179458.0, ans=0.0 2023-10-12 21:01:42,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.63 vs. limit=15.0 2023-10-12 21:01:47,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1179458.0, ans=0.125 2023-10-12 21:01:55,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179504.6666666667, ans=0.1 2023-10-12 21:01:57,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1179504.6666666667, ans=0.0 2023-10-12 21:02:01,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1179551.3333333333, ans=0.0 2023-10-12 21:02:06,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1179551.3333333333, ans=0.125 2023-10-12 21:02:11,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179598.0, ans=0.1 2023-10-12 21:02:15,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1179598.0, ans=0.1 2023-10-12 21:02:17,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1179598.0, ans=0.0 2023-10-12 21:02:20,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1179644.6666666667, ans=0.2 2023-10-12 21:02:20,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1179644.6666666667, ans=0.0 2023-10-12 21:02:33,849 INFO [train.py:1031] (3/4) Epoch 19, batch 7000, loss[loss=0.2013, simple_loss=0.2651, pruned_loss=0.06872, over 12297.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2836, pruned_loss=0.05068, over 31832682.61 frames. ], batch size: 440, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 21:02:37,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1179691.3333333333, ans=0.125 2023-10-12 21:02:41,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=15.0 2023-10-12 21:03:04,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.792e+02 1.888e+02 2.090e+02 4.225e+02, threshold=3.776e+02, percent-clipped=1.0 2023-10-12 21:03:06,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1179784.6666666667, ans=0.125 2023-10-12 21:03:16,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1179831.3333333333, ans=0.04949747468305833 2023-10-12 21:03:23,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1179878.0, ans=0.0 2023-10-12 21:03:27,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1179878.0, ans=0.125 2023-10-12 21:03:31,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1179878.0, ans=0.0 2023-10-12 21:03:43,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1179924.6666666667, ans=0.0 2023-10-12 21:04:05,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1180018.0, ans=0.125 2023-10-12 21:04:07,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1180018.0, ans=0.125 2023-10-12 21:04:30,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1180111.3333333333, ans=0.125 2023-10-12 21:04:38,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1180158.0, ans=0.125 2023-10-12 21:04:52,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1180204.6666666667, ans=0.0 2023-10-12 21:04:53,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1180251.3333333333, ans=0.125 2023-10-12 21:04:56,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.799e+02 1.960e+02 2.183e+02 2.964e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-12 21:05:04,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1180251.3333333333, ans=0.125 2023-10-12 21:05:04,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1180251.3333333333, ans=0.125 2023-10-12 21:05:07,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1180298.0, ans=0.2 2023-10-12 21:05:15,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1180298.0, ans=0.1 2023-10-12 21:05:16,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=15.0 2023-10-12 21:05:43,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1180438.0, ans=0.0 2023-10-12 21:05:54,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1180484.6666666667, ans=0.125 2023-10-12 21:06:05,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.17 vs. limit=15.0 2023-10-12 21:06:12,456 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:06:14,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1180578.0, ans=0.125 2023-10-12 21:06:35,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1180624.6666666667, ans=0.0 2023-10-12 21:06:35,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1180624.6666666667, ans=0.125 2023-10-12 21:06:42,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1180671.3333333333, ans=0.125 2023-10-12 21:06:43,951 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:07:02,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.768e+02 1.907e+02 2.122e+02 2.783e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 21:07:53,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1180904.6666666667, ans=0.0 2023-10-12 21:08:03,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=12.0 2023-10-12 21:08:03,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1180951.3333333333, ans=0.0 2023-10-12 21:08:09,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.97 vs. limit=6.0 2023-10-12 21:08:21,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1180998.0, ans=0.025 2023-10-12 21:08:22,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=15.0 2023-10-12 21:08:32,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1181044.6666666667, ans=0.0 2023-10-12 21:09:08,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.740e+02 1.906e+02 2.170e+02 3.636e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 21:09:13,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1181184.6666666667, ans=0.125 2023-10-12 21:09:21,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1181231.3333333333, ans=0.2 2023-10-12 21:09:47,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=12.0 2023-10-12 21:09:56,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1181371.3333333333, ans=0.125 2023-10-12 21:10:10,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181418.0, ans=0.1 2023-10-12 21:10:15,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181464.6666666667, ans=0.1 2023-10-12 21:10:19,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1181464.6666666667, ans=0.0 2023-10-12 21:10:20,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1181464.6666666667, ans=0.125 2023-10-12 21:10:27,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1181511.3333333333, ans=0.0 2023-10-12 21:10:52,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1181604.6666666667, ans=0.125 2023-10-12 21:10:58,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.810e+02 2.042e+02 2.391e+02 4.638e+02, threshold=4.083e+02, percent-clipped=1.0 2023-10-12 21:11:00,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1181651.3333333333, ans=0.0 2023-10-12 21:11:04,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1181651.3333333333, ans=0.0 2023-10-12 21:11:07,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181698.0, ans=0.1 2023-10-12 21:11:13,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1181698.0, ans=0.0 2023-10-12 21:11:14,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181698.0, ans=0.1 2023-10-12 21:11:17,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=22.5 2023-10-12 21:11:25,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-12 21:11:37,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1181791.3333333333, ans=0.125 2023-10-12 21:11:37,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=15.0 2023-10-12 21:11:42,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1181838.0, ans=0.2 2023-10-12 21:11:55,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181884.6666666667, ans=0.1 2023-10-12 21:12:11,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1181978.0, ans=0.125 2023-10-12 21:12:22,865 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-10-12 21:12:24,595 INFO [train.py:1031] (3/4) Epoch 19, batch 7500, loss[loss=0.1978, simple_loss=0.2891, pruned_loss=0.05328, over 16874.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2834, pruned_loss=0.05074, over 32019571.73 frames. ], batch size: 130, lr: 1.82e-03, grad_scale: 16.0 2023-10-12 21:12:32,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1182024.6666666667, ans=0.95 2023-10-12 21:12:50,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.780e+02 1.931e+02 2.112e+02 4.330e+02, threshold=3.863e+02, percent-clipped=1.0 2023-10-12 21:13:01,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=22.5 2023-10-12 21:13:17,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1182211.3333333333, ans=0.0 2023-10-12 21:13:20,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1182258.0, ans=0.2 2023-10-12 21:13:41,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1182304.6666666667, ans=0.0 2023-10-12 21:13:41,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-12 21:14:00,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1182398.0, ans=0.125 2023-10-12 21:14:02,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1182398.0, ans=0.05 2023-10-12 21:14:06,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1182444.6666666667, ans=0.0 2023-10-12 21:14:06,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1182444.6666666667, ans=0.0 2023-10-12 21:14:27,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1182491.3333333333, ans=0.125 2023-10-12 21:14:28,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182538.0, ans=0.1 2023-10-12 21:14:44,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.649e+02 1.835e+02 2.025e+02 2.612e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-12 21:15:12,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1182678.0, ans=0.125 2023-10-12 21:15:17,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1182678.0, ans=0.125 2023-10-12 21:15:17,529 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.32 vs. limit=15.0 2023-10-12 21:15:18,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.32 vs. limit=10.0 2023-10-12 21:15:31,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1182724.6666666667, ans=0.0 2023-10-12 21:15:38,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1182771.3333333333, ans=0.0 2023-10-12 21:15:41,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1182771.3333333333, ans=0.0 2023-10-12 21:16:06,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1182864.6666666667, ans=0.0 2023-10-12 21:16:54,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.730e+02 1.909e+02 2.159e+02 3.204e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 21:16:55,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1183051.3333333333, ans=0.125 2023-10-12 21:17:05,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1183098.0, ans=0.0 2023-10-12 21:17:22,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1183144.6666666667, ans=0.2 2023-10-12 21:17:29,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1183191.3333333333, ans=0.1 2023-10-12 21:17:31,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.59 vs. limit=15.0 2023-10-12 21:18:00,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1183331.3333333333, ans=0.125 2023-10-12 21:18:26,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.18 vs. limit=15.0 2023-10-12 21:18:42,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1183471.3333333333, ans=0.1 2023-10-12 21:18:49,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.877e+02 2.039e+02 2.225e+02 2.841e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-12 21:18:56,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=15.0 2023-10-12 21:19:20,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1183611.3333333333, ans=0.0 2023-10-12 21:19:53,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-12 21:19:57,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1183751.3333333333, ans=0.1 2023-10-12 21:20:02,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1183798.0, ans=0.125 2023-10-12 21:20:07,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1183798.0, ans=0.0 2023-10-12 21:20:12,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-10-12 21:20:20,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1183844.6666666667, ans=0.125 2023-10-12 21:20:27,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1183891.3333333333, ans=0.09899494936611666 2023-10-12 21:20:35,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1183938.0, ans=0.125 2023-10-12 21:20:49,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.648e+02 1.800e+02 1.959e+02 2.635e+02, threshold=3.600e+02, percent-clipped=0.0 2023-10-12 21:20:59,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-10-12 21:21:15,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1184078.0, ans=0.1 2023-10-12 21:21:20,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1184078.0, ans=0.1 2023-10-12 21:21:21,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-12 21:21:48,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1184218.0, ans=0.125 2023-10-12 21:21:59,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1184264.6666666667, ans=0.125 2023-10-12 21:22:05,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1184264.6666666667, ans=0.125 2023-10-12 21:22:09,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184264.6666666667, ans=0.1 2023-10-12 21:22:21,530 INFO [train.py:1031] (3/4) Epoch 19, batch 8000, loss[loss=0.1733, simple_loss=0.2675, pruned_loss=0.03953, over 16279.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2828, pruned_loss=0.05023, over 32196702.89 frames. ], batch size: 50, lr: 1.82e-03, grad_scale: 32.0 2023-10-12 21:22:38,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-10-12 21:22:46,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.624e+02 1.821e+02 2.030e+02 3.131e+02, threshold=3.642e+02, percent-clipped=0.0 2023-10-12 21:23:05,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1184544.6666666667, ans=0.125 2023-10-12 21:23:21,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.44 vs. limit=15.0 2023-10-12 21:23:24,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1184591.3333333333, ans=0.125 2023-10-12 21:23:27,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1184638.0, ans=0.2 2023-10-12 21:23:36,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-10-12 21:23:44,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1184684.6666666667, ans=0.2 2023-10-12 21:23:46,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1184731.3333333333, ans=0.125 2023-10-12 21:24:01,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1184778.0, ans=0.0 2023-10-12 21:24:25,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-10-12 21:24:27,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1184918.0, ans=0.2 2023-10-12 21:24:27,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1184918.0, ans=0.05 2023-10-12 21:24:28,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184918.0, ans=0.1 2023-10-12 21:24:32,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.700e+02 1.983e+02 2.371e+02 3.151e+02, threshold=3.965e+02, percent-clipped=0.0 2023-10-12 21:24:39,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1184964.6666666667, ans=0.125 2023-10-12 21:24:40,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1184964.6666666667, ans=0.0 2023-10-12 21:25:04,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1185011.3333333333, ans=0.0 2023-10-12 21:25:19,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1185058.0, ans=0.125 2023-10-12 21:25:45,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-10-12 21:26:15,031 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:26:17,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2023-10-12 21:26:24,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1185291.3333333333, ans=0.0 2023-10-12 21:26:47,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.666e+02 1.873e+02 2.016e+02 2.756e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-12 21:26:54,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-10-12 21:26:56,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1185431.3333333333, ans=0.2 2023-10-12 21:26:56,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1185431.3333333333, ans=0.125 2023-10-12 21:26:59,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-12 21:27:08,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1185478.0, ans=0.125 2023-10-12 21:27:34,375 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:27:57,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1185664.6666666667, ans=0.125 2023-10-12 21:27:58,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1185664.6666666667, ans=0.1 2023-10-12 21:28:08,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1185711.3333333333, ans=0.125 2023-10-12 21:28:11,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1185711.3333333333, ans=0.125 2023-10-12 21:28:19,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1185758.0, ans=0.125 2023-10-12 21:28:34,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1185804.6666666667, ans=15.0 2023-10-12 21:28:46,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.740e+02 1.900e+02 2.254e+02 3.034e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-12 21:28:47,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1185851.3333333333, ans=0.2 2023-10-12 21:28:49,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1185851.3333333333, ans=0.2 2023-10-12 21:29:01,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.82 vs. limit=15.0 2023-10-12 21:29:16,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1185991.3333333333, ans=0.125 2023-10-12 21:29:22,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1185991.3333333333, ans=0.0 2023-10-12 21:29:24,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1185991.3333333333, ans=0.1 2023-10-12 21:29:55,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1186131.3333333333, ans=0.0 2023-10-12 21:30:28,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1186271.3333333333, ans=0.5 2023-10-12 21:30:33,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-10-12 21:30:41,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.739e+02 1.926e+02 2.125e+02 2.852e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-12 21:31:04,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1186411.3333333333, ans=0.125 2023-10-12 21:31:06,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1186411.3333333333, ans=0.125 2023-10-12 21:31:13,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1186458.0, ans=0.2 2023-10-12 21:31:37,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1186551.3333333333, ans=0.09899494936611666 2023-10-12 21:31:46,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1186598.0, ans=0.0 2023-10-12 21:32:02,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=8.0 2023-10-12 21:32:12,910 INFO [train.py:1031] (3/4) Epoch 19, batch 8500, loss[loss=0.1892, simple_loss=0.279, pruned_loss=0.04968, over 16170.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2831, pruned_loss=0.05023, over 32325287.72 frames. ], batch size: 43, lr: 1.82e-03, grad_scale: 16.0 2023-10-12 21:32:17,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1186691.3333333333, ans=0.125 2023-10-12 21:32:20,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1186691.3333333333, ans=0.0 2023-10-12 21:32:38,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.814e+02 1.957e+02 2.179e+02 2.910e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-12 21:32:39,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1186784.6666666667, ans=0.125 2023-10-12 21:32:50,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2023-10-12 21:32:56,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1186878.0, ans=0.125 2023-10-12 21:32:58,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1186878.0, ans=0.125 2023-10-12 21:33:02,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1186878.0, ans=0.0 2023-10-12 21:33:03,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1186878.0, ans=0.0 2023-10-12 21:33:08,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.80 vs. limit=15.0 2023-10-12 21:33:43,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-10-12 21:33:52,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=15.0 2023-10-12 21:34:15,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1187158.0, ans=0.0 2023-10-12 21:34:18,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1187158.0, ans=0.125 2023-10-12 21:34:30,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1187204.6666666667, ans=0.0 2023-10-12 21:34:32,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1187204.6666666667, ans=0.125 2023-10-12 21:34:43,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.771e+02 1.970e+02 2.465e+02 3.662e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-12 21:34:43,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1187251.3333333333, ans=0.125 2023-10-12 21:34:48,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-10-12 21:34:50,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.84 vs. limit=15.0 2023-10-12 21:35:14,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1187391.3333333333, ans=0.125 2023-10-12 21:35:17,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1187391.3333333333, ans=0.0 2023-10-12 21:35:39,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1187484.6666666667, ans=0.125 2023-10-12 21:35:47,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-10-12 21:35:55,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1187531.3333333333, ans=0.125 2023-10-12 21:35:59,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-12 21:36:00,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-12 21:36:10,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.84 vs. limit=15.0 2023-10-12 21:36:26,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-10-12 21:36:37,495 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:36:39,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-10-12 21:36:47,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.704e+02 1.883e+02 2.095e+02 2.945e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 21:36:58,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1187764.6666666667, ans=0.0 2023-10-12 21:37:15,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1187858.0, ans=0.0 2023-10-12 21:37:46,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1187951.3333333333, ans=0.125 2023-10-12 21:38:07,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1188044.6666666667, ans=0.0 2023-10-12 21:38:11,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1188044.6666666667, ans=0.0 2023-10-12 21:38:21,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1188091.3333333333, ans=0.125 2023-10-12 21:38:34,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1188138.0, ans=0.09899494936611666 2023-10-12 21:38:42,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1188184.6666666667, ans=0.125 2023-10-12 21:38:45,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.685e+02 1.965e+02 2.209e+02 3.271e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-12 21:38:45,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1188184.6666666667, ans=0.125 2023-10-12 21:38:52,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1188231.3333333333, ans=0.0 2023-10-12 21:38:55,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1188231.3333333333, ans=0.125 2023-10-12 21:38:56,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188231.3333333333, ans=0.1 2023-10-12 21:39:02,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1188278.0, ans=0.125 2023-10-12 21:39:04,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1188278.0, ans=0.125 2023-10-12 21:39:23,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2023-10-12 21:39:25,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1188371.3333333333, ans=0.2 2023-10-12 21:39:31,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-10-12 21:39:33,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1188418.0, ans=0.125 2023-10-12 21:39:46,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.47 vs. limit=15.0 2023-10-12 21:39:50,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1188464.6666666667, ans=0.125 2023-10-12 21:40:21,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1188604.6666666667, ans=0.125 2023-10-12 21:40:36,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.747e+02 1.952e+02 2.143e+02 2.989e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 21:41:00,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1188744.6666666667, ans=0.2 2023-10-12 21:41:30,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1188884.6666666667, ans=0.1 2023-10-12 21:41:40,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-10-12 21:41:45,976 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.36 vs. limit=15.0 2023-10-12 21:41:51,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1188931.3333333333, ans=0.125 2023-10-12 21:42:03,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.45 vs. limit=15.0 2023-10-12 21:42:05,734 INFO [train.py:1031] (3/4) Epoch 19, batch 9000, loss[loss=0.2041, simple_loss=0.3046, pruned_loss=0.05179, over 16494.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2824, pruned_loss=0.05004, over 32426505.47 frames. ], batch size: 266, lr: 1.82e-03, grad_scale: 8.0 2023-10-12 21:42:10,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1189024.6666666667, ans=0.2 2023-10-12 21:42:14,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1189024.6666666667, ans=0.2 2023-10-12 21:42:19,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1189071.3333333333, ans=0.0 2023-10-12 21:42:33,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.814e+02 1.993e+02 2.342e+02 3.365e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-12 21:42:53,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1189211.3333333333, ans=0.125 2023-10-12 21:43:03,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1189258.0, ans=0.125 2023-10-12 21:43:14,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1189304.6666666667, ans=0.0 2023-10-12 21:43:30,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1189351.3333333333, ans=0.0 2023-10-12 21:43:31,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-10-12 21:43:35,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1189398.0, ans=10.0 2023-10-12 21:43:41,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-12 21:43:46,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1189444.6666666667, ans=0.0 2023-10-12 21:43:52,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1189444.6666666667, ans=0.0 2023-10-12 21:43:53,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.10 vs. limit=15.0 2023-10-12 21:44:03,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1189491.3333333333, ans=0.125 2023-10-12 21:44:14,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1189584.6666666667, ans=0.1 2023-10-12 21:44:17,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1189584.6666666667, ans=0.0 2023-10-12 21:44:21,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.734e+02 1.874e+02 2.121e+02 2.518e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-12 21:44:54,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1189724.6666666667, ans=0.09899494936611666 2023-10-12 21:45:17,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1189818.0, ans=0.125 2023-10-12 21:45:18,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1189818.0, ans=0.125 2023-10-12 21:45:20,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-12 21:45:48,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1189958.0, ans=0.1 2023-10-12 21:45:48,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1189958.0, ans=0.125 2023-10-12 21:45:49,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1189958.0, ans=0.125 2023-10-12 21:46:05,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1190051.3333333333, ans=0.0 2023-10-12 21:46:06,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-10-12 21:46:10,376 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-10-12 21:46:10,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.782e+02 1.982e+02 2.351e+02 3.767e+02, threshold=3.965e+02, percent-clipped=1.0 2023-10-12 21:46:12,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1190051.3333333333, ans=0.2 2023-10-12 21:46:14,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1190098.0, ans=0.07 2023-10-12 21:46:22,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1190098.0, ans=0.125 2023-10-12 21:46:34,225 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2023-10-12 21:46:38,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1190191.3333333333, ans=0.0 2023-10-12 21:47:17,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1190331.3333333333, ans=0.0 2023-10-12 21:47:19,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1190378.0, ans=0.125 2023-10-12 21:47:21,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1190378.0, ans=0.0 2023-10-12 21:47:39,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1190471.3333333333, ans=0.0 2023-10-12 21:47:47,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1190471.3333333333, ans=0.5 2023-10-12 21:47:58,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.803e+02 1.973e+02 2.178e+02 3.103e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-12 21:48:12,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1190564.6666666667, ans=0.125 2023-10-12 21:48:18,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=22.5 2023-10-12 21:48:32,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1190658.0, ans=0.125 2023-10-12 21:48:58,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1190704.6666666667, ans=0.0 2023-10-12 21:49:05,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1190751.3333333333, ans=0.125 2023-10-12 21:49:22,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1190798.0, ans=0.125 2023-10-12 21:49:23,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1190844.6666666667, ans=0.07 2023-10-12 21:49:30,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-10-12 21:49:51,163 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:49:58,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1190938.0, ans=0.2 2023-10-12 21:50:00,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190984.6666666667, ans=0.1 2023-10-12 21:50:08,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.786e+02 1.969e+02 2.151e+02 3.515e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-12 21:50:14,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1191031.3333333333, ans=0.125 2023-10-12 21:50:21,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2023-10-12 21:50:34,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1191078.0, ans=0.1 2023-10-12 21:50:35,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.42 vs. limit=10.0 2023-10-12 21:50:36,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-12 21:50:39,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1191124.6666666667, ans=0.0 2023-10-12 21:50:43,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1191124.6666666667, ans=0.125 2023-10-12 21:51:01,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1191218.0, ans=0.125 2023-10-12 21:51:06,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1191218.0, ans=0.0 2023-10-12 21:51:27,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1191311.3333333333, ans=0.2 2023-10-12 21:51:31,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-12 21:51:33,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-10-12 21:51:34,869 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:51:39,903 INFO [train.py:1031] (3/4) Epoch 19, batch 9500, loss[loss=0.189, simple_loss=0.2872, pruned_loss=0.0454, over 16815.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2832, pruned_loss=0.05044, over 32476573.55 frames. ], batch size: 98, lr: 1.82e-03, grad_scale: 8.0 2023-10-12 21:52:07,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1191451.3333333333, ans=0.1 2023-10-12 21:52:09,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.745e+02 1.884e+02 2.005e+02 2.558e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-12 21:52:30,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1191544.6666666667, ans=0.125 2023-10-12 21:52:41,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1191591.3333333333, ans=0.125 2023-10-12 21:52:53,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1191638.0, ans=0.125 2023-10-12 21:52:53,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1191638.0, ans=0.125 2023-10-12 21:52:57,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1191684.6666666667, ans=0.0 2023-10-12 21:53:50,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1191871.3333333333, ans=0.0 2023-10-12 21:53:58,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1191918.0, ans=0.125 2023-10-12 21:54:07,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.812e+02 1.991e+02 2.284e+02 3.020e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-12 21:54:11,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2023-10-12 21:54:27,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1192011.3333333333, ans=0.0 2023-10-12 21:54:28,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1192011.3333333333, ans=0.125 2023-10-12 21:54:32,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=12.0 2023-10-12 21:54:35,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=22.5 2023-10-12 21:54:52,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1192104.6666666667, ans=0.1 2023-10-12 21:55:09,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1192198.0, ans=0.035 2023-10-12 21:55:16,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1192198.0, ans=0.125 2023-10-12 21:55:24,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1192244.6666666667, ans=0.125 2023-10-12 21:55:33,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1192291.3333333333, ans=0.0 2023-10-12 21:55:40,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1192291.3333333333, ans=0.07 2023-10-12 21:55:46,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192291.3333333333, ans=0.1 2023-10-12 21:55:58,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1192384.6666666667, ans=0.5 2023-10-12 21:55:59,357 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:56:05,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.832e+02 1.943e+02 2.178e+02 2.882e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-12 21:56:05,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1192384.6666666667, ans=0.0 2023-10-12 21:56:13,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1192431.3333333333, ans=0.125 2023-10-12 21:56:16,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1192431.3333333333, ans=0.1 2023-10-12 21:56:59,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1192618.0, ans=0.1 2023-10-12 21:57:07,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1192664.6666666667, ans=0.2 2023-10-12 21:57:13,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1192664.6666666667, ans=0.125 2023-10-12 21:57:22,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1192711.3333333333, ans=0.07 2023-10-12 21:57:34,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1192758.0, ans=0.125 2023-10-12 21:58:03,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.759e+02 1.949e+02 2.211e+02 3.322e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 21:58:10,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1192898.0, ans=0.125 2023-10-12 21:58:12,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1192898.0, ans=0.0 2023-10-12 21:58:12,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1192898.0, ans=0.1 2023-10-12 21:59:06,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1193084.6666666667, ans=0.125 2023-10-12 22:00:02,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1193318.0, ans=0.125 2023-10-12 22:00:03,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.787e+02 2.024e+02 2.272e+02 3.095e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-12 22:00:06,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1193364.6666666667, ans=0.0 2023-10-12 22:00:07,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=22.5 2023-10-12 22:00:17,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.88 vs. limit=15.0 2023-10-12 22:00:24,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1193411.3333333333, ans=0.2 2023-10-12 22:00:36,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1193458.0, ans=0.1 2023-10-12 22:00:43,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-10-12 22:00:48,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1193504.6666666667, ans=0.125 2023-10-12 22:01:04,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=22.5 2023-10-12 22:01:17,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1193644.6666666667, ans=0.125 2023-10-12 22:01:18,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1193644.6666666667, ans=0.125 2023-10-12 22:01:23,709 INFO [train.py:1031] (3/4) Epoch 19, batch 10000, loss[loss=0.2304, simple_loss=0.3058, pruned_loss=0.07753, over 16049.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2823, pruned_loss=0.05012, over 32531557.90 frames. ], batch size: 297, lr: 1.81e-03, grad_scale: 32.0 2023-10-12 22:01:36,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.08 vs. limit=15.0 2023-10-12 22:01:49,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1193784.6666666667, ans=0.125 2023-10-12 22:01:53,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1193784.6666666667, ans=0.125 2023-10-12 22:01:57,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.714e+02 1.863e+02 2.060e+02 2.907e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-12 22:02:03,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1193831.3333333333, ans=0.2 2023-10-12 22:02:16,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1193878.0, ans=0.1 2023-10-12 22:02:20,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1193878.0, ans=0.2 2023-10-12 22:02:26,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1193924.6666666667, ans=0.0 2023-10-12 22:02:28,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1193924.6666666667, ans=0.0 2023-10-12 22:02:33,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=15.0 2023-10-12 22:03:06,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1194064.6666666667, ans=0.0 2023-10-12 22:03:19,066 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:04:04,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1194251.3333333333, ans=0.125 2023-10-12 22:04:05,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1194251.3333333333, ans=0.0 2023-10-12 22:04:05,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.730e+02 1.894e+02 2.125e+02 4.373e+02, threshold=3.787e+02, percent-clipped=1.0 2023-10-12 22:04:24,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1194344.6666666667, ans=0.2 2023-10-12 22:04:24,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1194344.6666666667, ans=0.1 2023-10-12 22:04:40,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194438.0, ans=0.1 2023-10-12 22:04:41,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1194438.0, ans=0.125 2023-10-12 22:05:11,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1194531.3333333333, ans=0.0 2023-10-12 22:05:12,290 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-10-12 22:05:15,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1194578.0, ans=0.0 2023-10-12 22:05:48,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1194671.3333333333, ans=0.0 2023-10-12 22:05:49,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1194671.3333333333, ans=0.0 2023-10-12 22:05:58,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1194671.3333333333, ans=0.0 2023-10-12 22:06:12,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.750e+02 1.894e+02 2.090e+02 2.837e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-12 22:06:14,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1194764.6666666667, ans=0.125 2023-10-12 22:06:21,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-12 22:06:27,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1194811.3333333333, ans=0.2 2023-10-12 22:07:03,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-10-12 22:07:04,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1194904.6666666667, ans=0.125 2023-10-12 22:07:05,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1194951.3333333333, ans=0.125 2023-10-12 22:07:17,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1194951.3333333333, ans=0.0 2023-10-12 22:07:23,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.33 vs. limit=15.0 2023-10-12 22:08:17,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1195184.6666666667, ans=0.5 2023-10-12 22:08:20,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1195184.6666666667, ans=0.0 2023-10-12 22:08:20,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.772e+02 1.928e+02 2.175e+02 2.736e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-12 22:08:25,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1195231.3333333333, ans=0.125 2023-10-12 22:08:32,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1195278.0, ans=0.125 2023-10-12 22:08:35,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.0 2023-10-12 22:08:41,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1195278.0, ans=0.125 2023-10-12 22:08:51,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1195324.6666666667, ans=0.125 2023-10-12 22:09:03,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1195371.3333333333, ans=0.0 2023-10-12 22:09:04,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195371.3333333333, ans=0.1 2023-10-12 22:09:05,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1195371.3333333333, ans=0.125 2023-10-12 22:09:14,682 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:09:26,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1195464.6666666667, ans=0.125 2023-10-12 22:09:31,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.03 vs. limit=22.5 2023-10-12 22:09:50,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1195558.0, ans=0.09899494936611666 2023-10-12 22:10:06,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1195604.6666666667, ans=10.0 2023-10-12 22:10:07,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1195604.6666666667, ans=0.0 2023-10-12 22:10:08,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1195604.6666666667, ans=0.125 2023-10-12 22:10:23,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.759e+02 1.875e+02 2.077e+02 2.636e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-12 22:10:26,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1195698.0, ans=0.05 2023-10-12 22:10:38,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1195744.6666666667, ans=0.0 2023-10-12 22:10:40,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1195744.6666666667, ans=0.125 2023-10-12 22:10:42,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1195744.6666666667, ans=0.125 2023-10-12 22:11:05,586 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-12 22:11:08,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-10-12 22:11:34,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1195931.3333333333, ans=0.2 2023-10-12 22:11:43,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1195978.0, ans=0.125 2023-10-12 22:11:48,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.57 vs. limit=10.0 2023-10-12 22:11:52,313 INFO [train.py:1031] (3/4) Epoch 19, batch 10500, loss[loss=0.2176, simple_loss=0.293, pruned_loss=0.07113, over 15550.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.283, pruned_loss=0.05024, over 32613605.99 frames. ], batch size: 350, lr: 1.81e-03, grad_scale: 16.0 2023-10-12 22:11:52,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1196024.6666666667, ans=0.125 2023-10-12 22:12:02,267 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-12 22:12:33,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1196118.0, ans=0.2 2023-10-12 22:12:37,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.776e+02 1.918e+02 2.187e+02 2.697e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-12 22:12:54,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1196211.3333333333, ans=0.0 2023-10-12 22:12:55,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1196211.3333333333, ans=0.125 2023-10-12 22:13:27,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1196304.6666666667, ans=0.125 2023-10-12 22:13:51,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1196398.0, ans=0.125 2023-10-12 22:13:53,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1196398.0, ans=0.125 2023-10-12 22:14:02,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.48 vs. limit=15.0 2023-10-12 22:14:06,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1196444.6666666667, ans=0.125 2023-10-12 22:14:22,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1196491.3333333333, ans=0.125 2023-10-12 22:14:29,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.62 vs. limit=15.0 2023-10-12 22:14:46,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-10-12 22:14:51,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.787e+02 1.984e+02 2.107e+02 3.024e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-12 22:14:53,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1196631.3333333333, ans=0.0 2023-10-12 22:15:18,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1196724.6666666667, ans=0.0 2023-10-12 22:15:33,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-12 22:15:38,199 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:15:50,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1196818.0, ans=0.07 2023-10-12 22:15:53,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1196864.6666666667, ans=0.125 2023-10-12 22:16:14,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1196911.3333333333, ans=0.125 2023-10-12 22:16:21,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.30 vs. limit=10.0 2023-10-12 22:16:32,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1197004.6666666667, ans=0.125 2023-10-12 22:16:53,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.888e+02 2.130e+02 2.470e+02 3.787e+02, threshold=4.261e+02, percent-clipped=0.0 2023-10-12 22:17:31,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1197238.0, ans=0.1 2023-10-12 22:17:52,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1197284.6666666667, ans=0.0 2023-10-12 22:18:12,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1197378.0, ans=0.2 2023-10-12 22:18:13,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1197378.0, ans=0.125 2023-10-12 22:18:22,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.27 vs. limit=10.0 2023-10-12 22:18:49,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.842e+02 1.998e+02 2.167e+02 2.763e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-12 22:18:56,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1197564.6666666667, ans=0.0 2023-10-12 22:18:57,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1197564.6666666667, ans=0.125 2023-10-12 22:19:14,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1197658.0, ans=0.125 2023-10-12 22:19:22,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1197704.6666666667, ans=0.0 2023-10-12 22:19:42,296 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:19:43,450 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:19:49,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1197751.3333333333, ans=0.1 2023-10-12 22:19:53,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1197751.3333333333, ans=0.2 2023-10-12 22:20:19,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1197891.3333333333, ans=0.0 2023-10-12 22:20:24,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1197891.3333333333, ans=0.125 2023-10-12 22:20:45,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=1197984.6666666667, ans=10.0 2023-10-12 22:20:45,894 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=22.5 2023-10-12 22:20:50,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1197984.6666666667, ans=0.09899494936611666 2023-10-12 22:20:52,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1197984.6666666667, ans=0.1 2023-10-12 22:20:54,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.639e+02 1.787e+02 1.929e+02 2.792e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-12 22:20:57,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-10-12 22:20:59,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1198031.3333333333, ans=0.125 2023-10-12 22:22:07,976 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.33 vs. limit=22.5 2023-10-12 22:22:15,110 INFO [train.py:1031] (3/4) Epoch 19, batch 11000, loss[loss=0.1669, simple_loss=0.2682, pruned_loss=0.03282, over 16804.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.283, pruned_loss=0.05018, over 32650917.50 frames. ], batch size: 87, lr: 1.81e-03, grad_scale: 16.0 2023-10-12 22:22:15,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1198358.0, ans=0.0 2023-10-12 22:22:23,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1198358.0, ans=0.125 2023-10-12 22:22:28,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1198404.6666666667, ans=0.0 2023-10-12 22:22:30,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-12 22:22:36,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1198451.3333333333, ans=0.125 2023-10-12 22:22:42,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1198451.3333333333, ans=0.0 2023-10-12 22:22:50,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.817e+02 1.986e+02 2.225e+02 2.965e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-12 22:23:29,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198638.0, ans=0.1 2023-10-12 22:23:41,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1198684.6666666667, ans=0.5 2023-10-12 22:24:06,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1198778.0, ans=0.125 2023-10-12 22:24:35,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1198871.3333333333, ans=0.1 2023-10-12 22:24:37,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-10-12 22:24:42,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-10-12 22:24:58,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.670e+02 1.835e+02 2.100e+02 2.698e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-12 22:25:04,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1198964.6666666667, ans=0.125 2023-10-12 22:25:08,193 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:25:20,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2023-10-12 22:25:38,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1199058.0, ans=0.1 2023-10-12 22:25:40,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1199058.0, ans=0.125 2023-10-12 22:25:51,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1199104.6666666667, ans=0.07 2023-10-12 22:26:48,237 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-10-12 22:27:21,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1199338.0, ans=0.0 2023-10-12 22:27:29,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.92 vs. limit=15.0 2023-10-12 22:27:38,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.738e+02 1.859e+02 1.998e+02 2.829e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-12 22:27:48,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-10-12 22:27:51,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.96 vs. limit=15.0 2023-10-12 22:28:16,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1199571.3333333333, ans=0.0 2023-10-12 22:28:24,508 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-10-12 22:28:41,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199664.6666666667, ans=0.1 2023-10-12 22:28:57,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1199711.3333333333, ans=0.1 2023-10-12 22:29:25,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1199758.0, ans=0.125 2023-10-12 22:29:41,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1199758.0, ans=0.125 2023-10-12 22:30:00,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1199851.3333333333, ans=0.125 2023-10-12 22:30:04,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1199851.3333333333, ans=0.125 2023-10-12 22:30:07,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1199851.3333333333, ans=0.125 2023-10-12 22:30:09,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1199851.3333333333, ans=0.0 2023-10-12 22:30:15,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.680e+02 1.903e+02 2.210e+02 3.202e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-12 22:30:17,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.17 vs. limit=15.0 2023-10-12 22:30:27,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1199944.6666666667, ans=0.125 2023-10-12 22:30:29,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1199944.6666666667, ans=0.125 2023-10-12 22:30:35,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=22.5 2023-10-12 22:30:40,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1199991.3333333333, ans=0.1 2023-10-12 22:30:40,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1199991.3333333333, ans=0.125 2023-10-12 22:30:44,225 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:30:55,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1200038.0, ans=0.0 2023-10-12 22:30:56,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1200038.0, ans=0.125 2023-10-12 22:31:01,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1200038.0, ans=0.0 2023-10-12 22:31:24,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1200131.3333333333, ans=10.0 2023-10-12 22:31:45,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200224.6666666667, ans=0.1 2023-10-12 22:31:49,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200224.6666666667, ans=0.1 2023-10-12 22:31:51,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1200224.6666666667, ans=0.0 2023-10-12 22:31:56,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1200271.3333333333, ans=0.125 2023-10-12 22:32:04,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1200271.3333333333, ans=0.125 2023-10-12 22:32:21,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1200364.6666666667, ans=0.0 2023-10-12 22:32:22,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.847e+02 2.052e+02 2.155e+02 3.101e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-12 22:32:37,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1200411.3333333333, ans=0.125 2023-10-12 22:33:17,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1200551.3333333333, ans=0.125 2023-10-12 22:33:20,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1200551.3333333333, ans=0.125 2023-10-12 22:33:35,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1200644.6666666667, ans=0.1 2023-10-12 22:33:40,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-12 22:33:45,887 INFO [train.py:1031] (3/4) Epoch 19, batch 11500, loss[loss=0.1824, simple_loss=0.2848, pruned_loss=0.04004, over 16877.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2825, pruned_loss=0.04993, over 32661680.74 frames. ], batch size: 104, lr: 1.81e-03, grad_scale: 16.0 2023-10-12 22:33:57,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1200738.0, ans=0.07 2023-10-12 22:34:12,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1200784.6666666667, ans=0.125 2023-10-12 22:34:20,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.794e+02 1.951e+02 2.213e+02 4.157e+02, threshold=3.902e+02, percent-clipped=1.0 2023-10-12 22:34:24,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-10-12 22:34:39,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1200878.0, ans=0.0 2023-10-12 22:34:39,784 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:35:05,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1200971.3333333333, ans=0.125 2023-10-12 22:35:08,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201018.0, ans=0.1 2023-10-12 22:35:13,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2023-10-12 22:35:32,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1201064.6666666667, ans=0.2 2023-10-12 22:35:43,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-10-12 22:35:52,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1201158.0, ans=0.125 2023-10-12 22:35:54,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1201158.0, ans=10.0 2023-10-12 22:35:57,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1201158.0, ans=0.1 2023-10-12 22:36:01,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-10-12 22:36:17,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2023-10-12 22:36:24,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.738e+02 1.841e+02 2.038e+02 2.691e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-12 22:36:25,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1201298.0, ans=0.0 2023-10-12 22:36:37,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1201344.6666666667, ans=0.125 2023-10-12 22:36:39,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201344.6666666667, ans=0.1 2023-10-12 22:36:49,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1201391.3333333333, ans=0.125 2023-10-12 22:37:00,029 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:37:24,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1201531.3333333333, ans=0.125 2023-10-12 22:37:31,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1201578.0, ans=10.0 2023-10-12 22:37:34,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1201578.0, ans=0.125 2023-10-12 22:37:50,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201624.6666666667, ans=0.1 2023-10-12 22:38:05,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1201718.0, ans=0.125 2023-10-12 22:38:16,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.717e+02 1.884e+02 2.129e+02 3.304e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-12 22:38:34,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1201811.3333333333, ans=0.125 2023-10-12 22:38:54,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1201858.0, ans=0.2 2023-10-12 22:38:57,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.35 vs. limit=15.0 2023-10-12 22:39:08,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1201904.6666666667, ans=0.125 2023-10-12 22:39:09,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1201904.6666666667, ans=0.125 2023-10-12 22:39:19,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1201951.3333333333, ans=15.0 2023-10-12 22:39:24,052 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-10-12 22:39:38,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.72 vs. limit=22.5 2023-10-12 22:39:49,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1202044.6666666667, ans=0.07 2023-10-12 22:39:59,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1202091.3333333333, ans=0.0 2023-10-12 22:40:01,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1202091.3333333333, ans=0.0 2023-10-12 22:40:32,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.694e+02 1.841e+02 2.030e+02 2.551e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-12 22:40:34,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1202231.3333333333, ans=0.2 2023-10-12 22:41:01,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-10-12 22:41:05,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2023-10-12 22:41:08,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2023-10-12 22:41:09,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1202371.3333333333, ans=0.2 2023-10-12 22:41:11,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202371.3333333333, ans=0.1 2023-10-12 22:41:18,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1202418.0, ans=0.0 2023-10-12 22:41:22,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1202418.0, ans=0.2 2023-10-12 22:41:30,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1202464.6666666667, ans=0.1 2023-10-12 22:41:36,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1202464.6666666667, ans=10.0 2023-10-12 22:41:36,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=15.0 2023-10-12 22:41:47,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.12 vs. limit=10.0 2023-10-12 22:42:17,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1202604.6666666667, ans=0.0 2023-10-12 22:42:17,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1202604.6666666667, ans=0.0 2023-10-12 22:42:33,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.759e+02 1.919e+02 2.116e+02 3.023e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-12 22:42:34,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1202698.0, ans=0.125 2023-10-12 22:42:51,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1202744.6666666667, ans=10.0 2023-10-12 22:43:01,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1202791.3333333333, ans=0.125 2023-10-12 22:43:22,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202884.6666666667, ans=0.1 2023-10-12 22:43:38,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1202931.3333333333, ans=0.125 2023-10-12 22:43:39,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1202931.3333333333, ans=0.07 2023-10-12 22:43:43,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202978.0, ans=0.1 2023-10-12 22:43:43,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1202978.0, ans=0.02 2023-10-12 22:43:55,754 INFO [train.py:1031] (3/4) Epoch 19, batch 12000, loss[loss=0.1736, simple_loss=0.2685, pruned_loss=0.03936, over 16760.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2825, pruned_loss=0.04962, over 32727563.25 frames. ], batch size: 56, lr: 1.81e-03, grad_scale: 32.0 2023-10-12 22:44:03,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-12 22:44:15,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203071.3333333333, ans=0.1 2023-10-12 22:44:34,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.783e+02 1.956e+02 2.155e+02 3.074e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-12 22:44:35,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1203164.6666666667, ans=0.125 2023-10-12 22:44:50,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1203211.3333333333, ans=0.125 2023-10-12 22:45:38,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1203398.0, ans=0.05 2023-10-12 22:45:40,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1203444.6666666667, ans=0.0 2023-10-12 22:45:51,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1203491.3333333333, ans=0.0 2023-10-12 22:45:53,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1203491.3333333333, ans=0.125 2023-10-12 22:46:01,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1203538.0, ans=0.125 2023-10-12 22:46:09,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.53 vs. limit=10.0 2023-10-12 22:46:26,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.678e+02 1.808e+02 2.090e+02 3.027e+02, threshold=3.617e+02, percent-clipped=0.0 2023-10-12 22:46:27,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=22.5 2023-10-12 22:46:30,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-12 22:46:37,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1203678.0, ans=0.125 2023-10-12 22:46:40,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1203678.0, ans=0.125 2023-10-12 22:46:45,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-10-12 22:46:46,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1203724.6666666667, ans=0.125 2023-10-12 22:46:53,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1203724.6666666667, ans=0.0 2023-10-12 22:47:27,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1203864.6666666667, ans=0.0 2023-10-12 22:47:30,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1203864.6666666667, ans=0.035 2023-10-12 22:47:47,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1203911.3333333333, ans=0.125 2023-10-12 22:47:47,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-10-12 22:48:04,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1204004.6666666667, ans=0.125 2023-10-12 22:48:13,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1204051.3333333333, ans=0.0 2023-10-12 22:48:14,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1204051.3333333333, ans=0.0 2023-10-12 22:48:16,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1204051.3333333333, ans=0.1 2023-10-12 22:48:25,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1204098.0, ans=0.2 2023-10-12 22:48:26,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1204098.0, ans=0.125 2023-10-12 22:48:27,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.806e+02 1.996e+02 2.205e+02 3.319e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-12 22:49:14,008 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:49:17,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1204284.6666666667, ans=0.125 2023-10-12 22:49:35,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1204378.0, ans=0.04949747468305833 2023-10-12 22:50:11,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1204518.0, ans=0.0 2023-10-12 22:50:13,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1204518.0, ans=0.1 2023-10-12 22:50:21,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1204564.6666666667, ans=0.0 2023-10-12 22:50:24,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.748e+02 1.987e+02 2.168e+02 2.757e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-12 22:50:28,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1204564.6666666667, ans=0.2 2023-10-12 22:50:38,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1204611.3333333333, ans=0.0 2023-10-12 22:50:44,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1204658.0, ans=0.0 2023-10-12 22:51:13,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1204751.3333333333, ans=0.0 2023-10-12 22:51:24,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-10-12 22:52:19,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1204984.6666666667, ans=0.125 2023-10-12 22:52:24,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1205031.3333333333, ans=0.125 2023-10-12 22:52:28,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.765e+02 1.914e+02 2.120e+02 2.910e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-12 22:52:32,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.17 vs. limit=15.0 2023-10-12 22:52:33,137 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.45 vs. limit=15.0 2023-10-12 22:52:37,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1205078.0, ans=0.125 2023-10-12 22:52:53,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=15.0 2023-10-12 22:53:13,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1205171.3333333333, ans=0.1 2023-10-12 22:53:25,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1205218.0, ans=0.125 2023-10-12 22:53:28,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.80 vs. limit=6.0 2023-10-12 22:53:29,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1205264.6666666667, ans=0.125 2023-10-12 22:53:40,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=22.5 2023-10-12 22:53:47,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-10-12 22:53:50,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1205311.3333333333, ans=0.1 2023-10-12 22:53:53,345 INFO [train.py:1031] (3/4) Epoch 19, batch 12500, loss[loss=0.2008, simple_loss=0.2957, pruned_loss=0.05299, over 16861.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2823, pruned_loss=0.04966, over 32742976.89 frames. ], batch size: 188, lr: 1.81e-03, grad_scale: 32.0 2023-10-12 22:54:32,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.730e+02 1.873e+02 2.080e+02 3.168e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-12 22:54:37,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1205498.0, ans=0.0 2023-10-12 22:54:41,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.65 vs. limit=15.0 2023-10-12 22:54:47,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1205544.6666666667, ans=0.125 2023-10-12 22:54:56,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-10-12 22:54:57,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.01 vs. limit=15.0 2023-10-12 22:55:00,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.17 vs. limit=10.0 2023-10-12 22:55:08,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1205638.0, ans=0.125 2023-10-12 22:55:13,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1205638.0, ans=0.0 2023-10-12 22:55:17,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1205684.6666666667, ans=0.125 2023-10-12 22:55:24,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1205684.6666666667, ans=0.125 2023-10-12 22:56:04,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1205871.3333333333, ans=0.125 2023-10-12 22:56:14,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1205918.0, ans=0.125 2023-10-12 22:56:20,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1205918.0, ans=0.2 2023-10-12 22:56:31,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1205964.6666666667, ans=0.125 2023-10-12 22:56:32,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.755e+02 1.874e+02 2.175e+02 2.894e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-12 22:56:34,329 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:56:36,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1205964.6666666667, ans=0.1 2023-10-12 22:56:42,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1206011.3333333333, ans=0.0 2023-10-12 22:56:56,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-10-12 22:57:21,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=12.0 2023-10-12 22:57:23,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1206151.3333333333, ans=0.125 2023-10-12 22:57:37,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1206198.0, ans=0.125 2023-10-12 22:57:59,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-10-12 22:58:03,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1206291.3333333333, ans=0.2 2023-10-12 22:58:14,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1206338.0, ans=0.0 2023-10-12 22:58:18,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-10-12 22:58:32,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.751e+02 1.923e+02 2.097e+02 3.428e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-12 22:58:33,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1206431.3333333333, ans=0.0 2023-10-12 22:59:06,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1206571.3333333333, ans=0.05 2023-10-12 22:59:09,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1206571.3333333333, ans=0.0 2023-10-12 22:59:17,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1206618.0, ans=0.125 2023-10-12 22:59:21,249 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-10-12 22:59:23,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206618.0, ans=0.1 2023-10-12 22:59:23,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-10-12 22:59:32,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-10-12 22:59:42,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-10-12 22:59:49,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1206758.0, ans=0.035 2023-10-12 23:00:04,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1206804.6666666667, ans=0.125 2023-10-12 23:00:15,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1206851.3333333333, ans=0.0 2023-10-12 23:00:16,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1206851.3333333333, ans=0.2 2023-10-12 23:00:27,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.893e+02 2.185e+02 2.410e+02 3.248e+02, threshold=4.370e+02, percent-clipped=0.0 2023-10-12 23:00:47,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=15.0 2023-10-12 23:00:56,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-12 23:01:05,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1207038.0, ans=0.125 2023-10-12 23:02:01,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1207271.3333333333, ans=0.0 2023-10-12 23:02:16,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207318.0, ans=0.1 2023-10-12 23:02:27,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.652e+02 1.853e+02 2.126e+02 3.167e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-12 23:02:40,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1207411.3333333333, ans=0.125 2023-10-12 23:02:43,448 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.04 vs. limit=10.0 2023-10-12 23:02:47,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1207458.0, ans=0.1 2023-10-12 23:03:05,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.02 vs. limit=6.0 2023-10-12 23:03:16,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-10-12 23:03:43,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1207644.6666666667, ans=0.125 2023-10-12 23:03:47,170 INFO [train.py:1031] (3/4) Epoch 19, batch 13000, loss[loss=0.2058, simple_loss=0.292, pruned_loss=0.05981, over 16899.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2826, pruned_loss=0.04956, over 32745487.11 frames. ], batch size: 130, lr: 1.80e-03, grad_scale: 32.0 2023-10-12 23:03:47,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1207691.3333333333, ans=0.0 2023-10-12 23:04:22,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1207784.6666666667, ans=0.1 2023-10-12 23:04:34,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-10-12 23:04:34,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.765e+02 1.951e+02 2.198e+02 2.914e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-12 23:04:41,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1207831.3333333333, ans=0.07 2023-10-12 23:04:57,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1207924.6666666667, ans=0.125 2023-10-12 23:05:01,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1207924.6666666667, ans=0.2 2023-10-12 23:05:13,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=22.5 2023-10-12 23:05:19,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1207971.3333333333, ans=0.125 2023-10-12 23:05:23,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1208018.0, ans=0.125 2023-10-12 23:05:36,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1208064.6666666667, ans=0.125 2023-10-12 23:05:36,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1208064.6666666667, ans=0.1 2023-10-12 23:05:43,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1208064.6666666667, ans=0.0 2023-10-12 23:05:56,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=22.5 2023-10-12 23:06:00,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1208158.0, ans=0.125 2023-10-12 23:06:02,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1208158.0, ans=0.0 2023-10-12 23:06:37,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1208298.0, ans=0.125 2023-10-12 23:06:41,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.614e+02 1.778e+02 1.975e+02 2.768e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-12 23:07:03,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1208391.3333333333, ans=0.125 2023-10-12 23:07:18,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1208438.0, ans=0.0 2023-10-12 23:07:38,534 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-12 23:07:39,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1208484.6666666667, ans=0.125 2023-10-12 23:08:02,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1208578.0, ans=0.125 2023-10-12 23:08:15,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208624.6666666667, ans=0.1 2023-10-12 23:08:18,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.64 vs. limit=22.5 2023-10-12 23:08:22,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1208671.3333333333, ans=0.125 2023-10-12 23:08:26,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1208671.3333333333, ans=0.2 2023-10-12 23:08:30,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-12 23:08:36,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1208718.0, ans=0.2 2023-10-12 23:08:48,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.698e+02 1.854e+02 2.054e+02 2.727e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 23:09:01,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1208811.3333333333, ans=0.1 2023-10-12 23:09:02,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=15.0 2023-10-12 23:09:16,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1208858.0, ans=0.2 2023-10-12 23:09:17,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208858.0, ans=0.1 2023-10-12 23:10:00,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209044.6666666667, ans=0.1 2023-10-12 23:10:12,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1209091.3333333333, ans=0.2 2023-10-12 23:10:14,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1209091.3333333333, ans=0.125 2023-10-12 23:10:19,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1209091.3333333333, ans=0.125 2023-10-12 23:10:20,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1209091.3333333333, ans=0.125 2023-10-12 23:10:20,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1209091.3333333333, ans=0.5 2023-10-12 23:10:21,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1209091.3333333333, ans=0.0 2023-10-12 23:10:26,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.08 vs. limit=10.0 2023-10-12 23:10:27,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1209138.0, ans=0.125 2023-10-12 23:10:49,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.751e+02 1.882e+02 2.089e+02 2.854e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-12 23:11:10,146 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:11:29,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1209418.0, ans=0.0 2023-10-12 23:11:34,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1209418.0, ans=0.0 2023-10-12 23:11:46,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1209464.6666666667, ans=0.0 2023-10-12 23:11:47,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1209464.6666666667, ans=0.0 2023-10-12 23:11:55,088 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:11:55,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-12 23:11:59,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1209511.3333333333, ans=0.0 2023-10-12 23:12:09,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1209558.0, ans=0.035 2023-10-12 23:12:09,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1209558.0, ans=0.2 2023-10-12 23:12:34,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1209651.3333333333, ans=0.125 2023-10-12 23:12:34,941 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.34 vs. limit=12.0 2023-10-12 23:12:35,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1209651.3333333333, ans=0.0 2023-10-12 23:12:36,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1209651.3333333333, ans=0.0 2023-10-12 23:12:42,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1209698.0, ans=0.2 2023-10-12 23:12:47,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.764e+02 1.919e+02 2.102e+02 2.912e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-12 23:12:48,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1209698.0, ans=0.125 2023-10-12 23:13:00,655 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-10-12 23:13:14,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1209791.3333333333, ans=0.0 2023-10-12 23:13:31,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1209884.6666666667, ans=0.125 2023-10-12 23:13:44,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1209931.3333333333, ans=0.125 2023-10-12 23:14:01,985 INFO [train.py:1031] (3/4) Epoch 19, batch 13500, loss[loss=0.1801, simple_loss=0.2737, pruned_loss=0.04324, over 15939.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2819, pruned_loss=0.0493, over 32782043.75 frames. ], batch size: 43, lr: 1.80e-03, grad_scale: 16.0 2023-10-12 23:14:43,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210164.6666666667, ans=0.1 2023-10-12 23:14:43,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.770e+02 1.932e+02 2.121e+02 2.751e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-12 23:14:44,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1210164.6666666667, ans=0.04949747468305833 2023-10-12 23:15:02,722 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.36 vs. limit=5.0 2023-10-12 23:15:04,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-10-12 23:15:42,123 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-10-12 23:16:21,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1210538.0, ans=0.2 2023-10-12 23:16:36,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1210631.3333333333, ans=0.125 2023-10-12 23:16:36,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1210631.3333333333, ans=0.125 2023-10-12 23:16:37,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.772e+02 1.955e+02 2.136e+02 2.905e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-12 23:16:41,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1210631.3333333333, ans=0.05 2023-10-12 23:16:55,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1210724.6666666667, ans=0.125 2023-10-12 23:16:55,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=22.5 2023-10-12 23:17:41,200 INFO [train.py:1031] (3/4) Epoch 20, batch 0, loss[loss=0.1621, simple_loss=0.2556, pruned_loss=0.03428, over 15999.00 frames. ], tot_loss[loss=0.1621, simple_loss=0.2556, pruned_loss=0.03428, over 15999.00 frames. ], batch size: 43, lr: 1.75e-03, grad_scale: 32.0 2023-10-12 23:17:41,201 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-12 23:17:50,667 INFO [train.py:1063] (3/4) Epoch 20, validation: loss=0.2148, simple_loss=0.3012, pruned_loss=0.06418, over 1020973.00 frames. 2023-10-12 23:17:50,667 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-12 23:18:05,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1210794.6666666667, ans=0.0 2023-10-12 23:18:07,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1210794.6666666667, ans=0.125 2023-10-12 23:18:33,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1210888.0, ans=0.125 2023-10-12 23:18:45,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1210934.6666666667, ans=0.1 2023-10-12 23:19:03,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1210981.3333333333, ans=0.2 2023-10-12 23:19:24,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.04 vs. limit=22.5 2023-10-12 23:19:30,303 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:19:32,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1211074.6666666667, ans=0.1 2023-10-12 23:19:32,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.696e+02 1.854e+02 2.053e+02 3.223e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 23:19:48,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-12 23:20:00,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-10-12 23:20:21,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1211308.0, ans=0.125 2023-10-12 23:20:29,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1211308.0, ans=0.2 2023-10-12 23:20:37,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.91 vs. limit=15.0 2023-10-12 23:20:48,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1211401.3333333333, ans=0.125 2023-10-12 23:21:05,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211448.0, ans=0.1 2023-10-12 23:21:06,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1211448.0, ans=0.125 2023-10-12 23:21:09,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-10-12 23:21:31,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.700e+02 1.867e+02 2.075e+02 2.646e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-12 23:22:24,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1211821.3333333333, ans=0.125 2023-10-12 23:22:26,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.61 vs. limit=12.0 2023-10-12 23:22:33,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1211821.3333333333, ans=0.09899494936611666 2023-10-12 23:22:35,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1211868.0, ans=0.2 2023-10-12 23:22:41,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1211868.0, ans=0.125 2023-10-12 23:22:56,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211914.6666666667, ans=0.1 2023-10-12 23:23:20,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1212008.0, ans=0.0 2023-10-12 23:23:24,245 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:23:28,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.729e+02 1.884e+02 2.111e+02 2.882e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 23:23:35,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1212054.6666666667, ans=0.0 2023-10-12 23:23:38,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1212054.6666666667, ans=0.0 2023-10-12 23:23:42,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1212101.3333333333, ans=0.0 2023-10-12 23:23:56,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.60 vs. limit=22.5 2023-10-12 23:24:03,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-10-12 23:24:24,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1212241.3333333333, ans=0.09899494936611666 2023-10-12 23:24:28,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.13 vs. limit=15.0 2023-10-12 23:25:10,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1212474.6666666667, ans=0.07 2023-10-12 23:25:23,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.770e+02 1.959e+02 2.288e+02 3.198e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 23:25:55,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1212614.6666666667, ans=0.025 2023-10-12 23:26:08,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1212661.3333333333, ans=0.125 2023-10-12 23:26:23,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1212708.0, ans=0.0 2023-10-12 23:26:29,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1212754.6666666667, ans=0.125 2023-10-12 23:26:29,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1212754.6666666667, ans=0.2 2023-10-12 23:26:39,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1212801.3333333333, ans=0.125 2023-10-12 23:26:50,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1212848.0, ans=0.125 2023-10-12 23:27:04,572 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:27:12,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1212941.3333333333, ans=0.2 2023-10-12 23:27:17,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=22.5 2023-10-12 23:27:23,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1212941.3333333333, ans=0.0 2023-10-12 23:27:26,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.779e+02 1.950e+02 2.164e+02 2.885e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 23:27:31,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1212988.0, ans=0.1 2023-10-12 23:27:53,928 INFO [train.py:1031] (3/4) Epoch 20, batch 500, loss[loss=0.1881, simple_loss=0.2735, pruned_loss=0.05136, over 15623.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2831, pruned_loss=0.05035, over 7293668.04 frames. ], batch size: 35, lr: 1.75e-03, grad_scale: 16.0 2023-10-12 23:27:54,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1213081.3333333333, ans=0.125 2023-10-12 23:28:00,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1213081.3333333333, ans=0.025 2023-10-12 23:28:08,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.31 vs. limit=22.5 2023-10-12 23:28:50,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-10-12 23:29:21,010 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:29:52,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.838e+02 1.952e+02 2.214e+02 2.730e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 23:29:55,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1213454.6666666667, ans=0.1 2023-10-12 23:30:25,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2023-10-12 23:30:27,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1213594.6666666667, ans=0.2 2023-10-12 23:30:28,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1213594.6666666667, ans=0.1 2023-10-12 23:30:29,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1213594.6666666667, ans=0.0 2023-10-12 23:30:49,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1213688.0, ans=0.5 2023-10-12 23:31:09,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1213781.3333333333, ans=0.125 2023-10-12 23:31:20,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1213828.0, ans=0.125 2023-10-12 23:31:20,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-10-12 23:31:21,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1213828.0, ans=0.1 2023-10-12 23:31:25,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1213828.0, ans=0.125 2023-10-12 23:31:49,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.858e+02 2.054e+02 2.299e+02 3.420e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-12 23:32:14,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1214014.6666666667, ans=0.125 2023-10-12 23:32:19,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1214014.6666666667, ans=0.2 2023-10-12 23:32:26,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1214014.6666666667, ans=0.125 2023-10-12 23:32:26,847 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:33:03,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1214154.6666666667, ans=0.125 2023-10-12 23:33:06,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.91 vs. limit=10.0 2023-10-12 23:33:18,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1214201.3333333333, ans=0.0 2023-10-12 23:33:18,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1214201.3333333333, ans=0.125 2023-10-12 23:33:23,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1214248.0, ans=0.2 2023-10-12 23:33:30,790 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:33:37,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1214294.6666666667, ans=0.0 2023-10-12 23:33:40,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1214294.6666666667, ans=0.125 2023-10-12 23:33:47,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1214341.3333333333, ans=0.2 2023-10-12 23:34:03,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.690e+02 1.910e+02 2.143e+02 2.694e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-12 23:34:05,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-10-12 23:34:07,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1214388.0, ans=0.0 2023-10-12 23:34:11,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1214388.0, ans=0.2 2023-10-12 23:34:23,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1214434.6666666667, ans=0.0 2023-10-12 23:34:29,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1214481.3333333333, ans=0.035 2023-10-12 23:34:40,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1214528.0, ans=0.125 2023-10-12 23:35:24,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1214668.0, ans=0.125 2023-10-12 23:36:07,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1214808.0, ans=0.2 2023-10-12 23:36:12,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.781e+02 2.034e+02 2.321e+02 3.164e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-12 23:36:17,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1214854.6666666667, ans=0.125 2023-10-12 23:36:25,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1214901.3333333333, ans=0.0 2023-10-12 23:36:31,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1214901.3333333333, ans=0.0 2023-10-12 23:36:34,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=1214901.3333333333, ans=0.02 2023-10-12 23:36:42,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=22.5 2023-10-12 23:36:46,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-10-12 23:36:50,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1214994.6666666667, ans=0.0 2023-10-12 23:37:35,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1215134.6666666667, ans=0.125 2023-10-12 23:37:41,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1215181.3333333333, ans=0.0 2023-10-12 23:38:02,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-10-12 23:38:06,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1215274.6666666667, ans=0.125 2023-10-12 23:38:18,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.729e+02 1.930e+02 2.077e+02 2.899e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 23:38:18,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1215321.3333333333, ans=0.0 2023-10-12 23:38:36,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1215368.0, ans=0.1 2023-10-12 23:38:39,509 INFO [train.py:1031] (3/4) Epoch 20, batch 1000, loss[loss=0.17, simple_loss=0.27, pruned_loss=0.03497, over 16949.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2837, pruned_loss=0.05045, over 12957347.59 frames. ], batch size: 82, lr: 1.75e-03, grad_scale: 16.0 2023-10-12 23:38:45,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1215414.6666666667, ans=0.125 2023-10-12 23:38:45,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1215414.6666666667, ans=0.125 2023-10-12 23:38:51,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1215461.3333333333, ans=0.2 2023-10-12 23:39:02,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1215508.0, ans=0.125 2023-10-12 23:39:07,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1215508.0, ans=0.015 2023-10-12 23:39:07,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1215508.0, ans=0.125 2023-10-12 23:39:33,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1215601.3333333333, ans=0.0 2023-10-12 23:39:50,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1215694.6666666667, ans=0.05 2023-10-12 23:40:08,854 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-12 23:40:14,584 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.766e+02 1.904e+02 2.150e+02 3.324e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-12 23:40:53,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1215928.0, ans=0.125 2023-10-12 23:41:29,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.99 vs. limit=22.5 2023-10-12 23:41:44,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1216114.6666666667, ans=0.035 2023-10-12 23:41:59,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 23:42:23,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=22.5 2023-10-12 23:42:24,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.693e+02 1.838e+02 2.115e+02 3.092e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-12 23:42:37,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216301.3333333333, ans=0.1 2023-10-12 23:42:37,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.29 vs. limit=22.5 2023-10-12 23:42:39,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.42 vs. limit=22.5 2023-10-12 23:43:04,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.52 vs. limit=15.0 2023-10-12 23:43:17,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1216441.3333333333, ans=0.125 2023-10-12 23:43:30,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1216488.0, ans=0.2 2023-10-12 23:43:38,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1216534.6666666667, ans=0.125 2023-10-12 23:43:44,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216534.6666666667, ans=0.1 2023-10-12 23:43:48,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1216581.3333333333, ans=0.125 2023-10-12 23:44:08,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.54 vs. limit=10.0 2023-10-12 23:44:15,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1216674.6666666667, ans=0.125 2023-10-12 23:44:24,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.652e+02 1.826e+02 2.090e+02 2.868e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-12 23:44:36,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216768.0, ans=0.1 2023-10-12 23:44:47,463 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.02 vs. limit=15.0 2023-10-12 23:44:59,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1216861.3333333333, ans=0.125 2023-10-12 23:45:02,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1216861.3333333333, ans=0.2 2023-10-12 23:45:10,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1216908.0, ans=0.125 2023-10-12 23:45:15,275 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:45:18,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.47 vs. limit=22.5 2023-10-12 23:45:37,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-12 23:45:41,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1217048.0, ans=0.1 2023-10-12 23:45:42,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=22.5 2023-10-12 23:46:00,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1217141.3333333333, ans=0.025 2023-10-12 23:46:02,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.72 vs. limit=15.0 2023-10-12 23:46:08,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1217141.3333333333, ans=0.02 2023-10-12 23:46:09,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1217141.3333333333, ans=0.2 2023-10-12 23:46:15,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.690e+02 1.904e+02 2.129e+02 3.001e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 23:46:21,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-10-12 23:46:52,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1217328.0, ans=0.0 2023-10-12 23:46:55,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1217328.0, ans=0.125 2023-10-12 23:47:08,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1217374.6666666667, ans=0.2 2023-10-12 23:47:18,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1217421.3333333333, ans=0.0 2023-10-12 23:47:23,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1217421.3333333333, ans=0.2 2023-10-12 23:47:40,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1217514.6666666667, ans=0.125 2023-10-12 23:47:47,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1217561.3333333333, ans=0.125 2023-10-12 23:47:48,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1217561.3333333333, ans=0.0 2023-10-12 23:47:53,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1217561.3333333333, ans=0.0 2023-10-12 23:47:58,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1217561.3333333333, ans=0.125 2023-10-12 23:48:13,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1217608.0, ans=0.1 2023-10-12 23:48:25,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.753e+02 1.975e+02 2.148e+02 3.605e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-12 23:48:28,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1217654.6666666667, ans=0.2 2023-10-12 23:48:46,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-10-12 23:48:49,141 INFO [train.py:1031] (3/4) Epoch 20, batch 1500, loss[loss=0.1669, simple_loss=0.2368, pruned_loss=0.04851, over 12627.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2816, pruned_loss=0.0494, over 17357845.41 frames. ], batch size: 440, lr: 1.75e-03, grad_scale: 32.0 2023-10-12 23:49:06,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-10-12 23:49:12,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2023-10-12 23:49:24,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1217888.0, ans=0.125 2023-10-12 23:49:29,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1217888.0, ans=0.1 2023-10-12 23:49:43,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1217934.6666666667, ans=0.125 2023-10-12 23:49:50,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1217934.6666666667, ans=0.125 2023-10-12 23:49:53,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1217981.3333333333, ans=0.125 2023-10-12 23:49:58,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.46 vs. limit=15.0 2023-10-12 23:50:20,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-10-12 23:50:20,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2023-10-12 23:50:29,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1218121.3333333333, ans=0.0 2023-10-12 23:50:33,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.785e+02 1.961e+02 2.168e+02 3.154e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-12 23:50:47,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1218168.0, ans=0.2 2023-10-12 23:51:05,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.25 vs. limit=15.0 2023-10-12 23:51:12,596 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2023-10-12 23:51:19,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.91 vs. limit=10.0 2023-10-12 23:51:34,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1218354.6666666667, ans=0.0 2023-10-12 23:51:47,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1218401.3333333333, ans=0.125 2023-10-12 23:51:50,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218401.3333333333, ans=0.1 2023-10-12 23:52:03,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.31 vs. limit=15.0 2023-10-12 23:52:08,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.17 vs. limit=10.0 2023-10-12 23:52:17,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1218494.6666666667, ans=0.125 2023-10-12 23:52:20,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1218494.6666666667, ans=0.0 2023-10-12 23:52:37,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1218541.3333333333, ans=0.125 2023-10-12 23:52:41,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1218588.0, ans=0.0 2023-10-12 23:52:43,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.720e+02 1.856e+02 2.026e+02 3.023e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 23:53:18,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1218728.0, ans=0.125 2023-10-12 23:53:20,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1218728.0, ans=0.125 2023-10-12 23:53:29,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1218774.6666666667, ans=0.05 2023-10-12 23:53:36,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218821.3333333333, ans=0.1 2023-10-12 23:53:52,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1218868.0, ans=0.125 2023-10-12 23:54:12,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-10-12 23:54:30,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1219008.0, ans=10.0 2023-10-12 23:54:41,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1219054.6666666667, ans=0.125 2023-10-12 23:54:43,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1219054.6666666667, ans=0.1 2023-10-12 23:54:43,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1219054.6666666667, ans=0.125 2023-10-12 23:54:45,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.785e+02 1.968e+02 2.240e+02 2.942e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-12 23:54:53,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1219101.3333333333, ans=0.95 2023-10-12 23:55:02,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219101.3333333333, ans=0.1 2023-10-12 23:55:06,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1219101.3333333333, ans=0.0 2023-10-12 23:55:13,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1219148.0, ans=0.09899494936611666 2023-10-12 23:55:18,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219148.0, ans=0.1 2023-10-12 23:55:22,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1219194.6666666667, ans=0.125 2023-10-12 23:55:27,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1219194.6666666667, ans=0.0 2023-10-12 23:55:31,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1219241.3333333333, ans=0.125 2023-10-12 23:55:32,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-12 23:55:35,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1219241.3333333333, ans=0.125 2023-10-12 23:55:39,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219241.3333333333, ans=0.1 2023-10-12 23:55:41,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1219241.3333333333, ans=0.0 2023-10-12 23:55:46,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1219288.0, ans=0.125 2023-10-12 23:56:00,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219334.6666666667, ans=0.1 2023-10-12 23:56:22,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1219428.0, ans=0.125 2023-10-12 23:56:33,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219474.6666666667, ans=0.1 2023-10-12 23:56:37,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1219474.6666666667, ans=0.125 2023-10-12 23:56:46,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1219521.3333333333, ans=0.0 2023-10-12 23:56:47,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1219521.3333333333, ans=0.1 2023-10-12 23:56:50,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1219521.3333333333, ans=0.0 2023-10-12 23:56:50,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.770e+02 1.967e+02 2.269e+02 2.840e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-12 23:56:51,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1219521.3333333333, ans=0.125 2023-10-12 23:57:04,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1219568.0, ans=0.125 2023-10-12 23:57:27,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1219661.3333333333, ans=0.125 2023-10-12 23:57:39,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1219708.0, ans=0.0 2023-10-12 23:57:59,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1219754.6666666667, ans=0.0 2023-10-12 23:58:14,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1219801.3333333333, ans=0.0 2023-10-12 23:58:40,932 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:58:41,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2023-10-12 23:58:44,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1219894.6666666667, ans=0.0 2023-10-12 23:58:44,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1219894.6666666667, ans=0.1 2023-10-12 23:58:58,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2023-10-12 23:59:02,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1219988.0, ans=0.0 2023-10-12 23:59:05,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.765e+02 1.933e+02 2.173e+02 2.941e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-12 23:59:26,567 INFO [train.py:1031] (3/4) Epoch 20, batch 2000, loss[loss=0.1851, simple_loss=0.285, pruned_loss=0.04264, over 16896.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2819, pruned_loss=0.0494, over 20743711.28 frames. ], batch size: 87, lr: 1.75e-03, grad_scale: 16.0 2023-10-12 23:59:29,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1220081.3333333333, ans=0.1 2023-10-12 23:59:29,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-10-12 23:59:31,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1220081.3333333333, ans=0.0 2023-10-12 23:59:49,140 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:59:49,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.49 vs. limit=15.0 2023-10-13 00:00:05,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1220221.3333333333, ans=0.125 2023-10-13 00:00:29,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1220268.0, ans=0.125 2023-10-13 00:00:50,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1220361.3333333333, ans=0.125 2023-10-13 00:00:59,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1220408.0, ans=0.025 2023-10-13 00:01:06,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1220408.0, ans=0.125 2023-10-13 00:01:07,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220408.0, ans=0.1 2023-10-13 00:01:13,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1220454.6666666667, ans=0.0 2023-10-13 00:01:15,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1220454.6666666667, ans=0.0 2023-10-13 00:01:17,216 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.682e+02 1.903e+02 2.210e+02 2.934e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-13 00:01:20,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1220454.6666666667, ans=0.125 2023-10-13 00:01:24,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1220501.3333333333, ans=0.0 2023-10-13 00:02:36,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1220641.3333333333, ans=0.125 2023-10-13 00:02:37,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-10-13 00:02:45,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1220688.0, ans=0.125 2023-10-13 00:03:26,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.57 vs. limit=10.0 2023-10-13 00:04:01,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1220921.3333333333, ans=0.2 2023-10-13 00:04:07,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.800e+02 1.996e+02 2.161e+02 3.768e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 00:04:44,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1221061.3333333333, ans=0.125 2023-10-13 00:04:59,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1221154.6666666667, ans=0.125 2023-10-13 00:04:59,545 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-10-13 00:05:08,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1221154.6666666667, ans=0.125 2023-10-13 00:05:23,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1221248.0, ans=0.1 2023-10-13 00:05:30,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1221248.0, ans=0.0 2023-10-13 00:05:31,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-10-13 00:05:34,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.17 vs. limit=10.0 2023-10-13 00:05:58,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1221388.0, ans=0.2 2023-10-13 00:06:04,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.864e+02 2.050e+02 2.346e+02 4.191e+02, threshold=4.100e+02, percent-clipped=1.0 2023-10-13 00:06:19,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1221434.6666666667, ans=0.125 2023-10-13 00:06:21,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1221434.6666666667, ans=0.0 2023-10-13 00:06:22,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1221481.3333333333, ans=0.125 2023-10-13 00:06:22,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1221481.3333333333, ans=0.2 2023-10-13 00:06:23,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1221481.3333333333, ans=0.2 2023-10-13 00:06:44,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1221528.0, ans=0.1 2023-10-13 00:06:45,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1221528.0, ans=0.0 2023-10-13 00:06:52,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.56 vs. limit=5.0 2023-10-13 00:07:57,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1221808.0, ans=0.125 2023-10-13 00:07:59,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1221808.0, ans=0.2 2023-10-13 00:08:05,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1221854.6666666667, ans=0.125 2023-10-13 00:08:06,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1221854.6666666667, ans=0.125 2023-10-13 00:08:08,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.797e+02 1.937e+02 2.103e+02 2.617e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 00:08:15,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.70 vs. limit=15.0 2023-10-13 00:08:17,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1221901.3333333333, ans=0.2 2023-10-13 00:08:18,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1221901.3333333333, ans=0.07 2023-10-13 00:08:23,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=15.0 2023-10-13 00:09:00,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1222088.0, ans=0.125 2023-10-13 00:09:02,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2023-10-13 00:09:17,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1222134.6666666667, ans=0.125 2023-10-13 00:09:26,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1222181.3333333333, ans=0.0 2023-10-13 00:09:58,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1222321.3333333333, ans=0.5 2023-10-13 00:10:01,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.721e+02 1.949e+02 2.146e+02 2.819e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-13 00:10:05,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-10-13 00:10:13,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1222368.0, ans=0.0 2023-10-13 00:10:17,633 INFO [train.py:1031] (3/4) Epoch 20, batch 2500, loss[loss=0.1862, simple_loss=0.2809, pruned_loss=0.04573, over 16866.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2824, pruned_loss=0.04973, over 23445968.37 frames. ], batch size: 188, lr: 1.75e-03, grad_scale: 16.0 2023-10-13 00:10:20,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1222414.6666666667, ans=0.125 2023-10-13 00:10:26,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222414.6666666667, ans=0.1 2023-10-13 00:10:43,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1222508.0, ans=0.125 2023-10-13 00:10:45,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1222508.0, ans=0.5 2023-10-13 00:10:46,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1222508.0, ans=0.125 2023-10-13 00:11:10,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1222601.3333333333, ans=0.0 2023-10-13 00:11:18,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1222648.0, ans=0.2 2023-10-13 00:11:29,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1222694.6666666667, ans=0.0 2023-10-13 00:11:52,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1222788.0, ans=0.125 2023-10-13 00:11:55,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1222788.0, ans=0.0 2023-10-13 00:11:58,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.816e+02 2.001e+02 2.170e+02 3.087e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-13 00:12:03,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1222834.6666666667, ans=0.125 2023-10-13 00:12:03,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-10-13 00:12:24,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1222881.3333333333, ans=0.0 2023-10-13 00:12:55,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1223021.3333333333, ans=0.125 2023-10-13 00:13:02,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1223021.3333333333, ans=0.125 2023-10-13 00:13:03,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1223068.0, ans=0.125 2023-10-13 00:13:05,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1223068.0, ans=0.0 2023-10-13 00:13:11,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1223068.0, ans=0.0 2023-10-13 00:13:23,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-10-13 00:13:24,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1223114.6666666667, ans=0.125 2023-10-13 00:13:29,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1223161.3333333333, ans=0.125 2023-10-13 00:13:32,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1223161.3333333333, ans=0.0 2023-10-13 00:13:32,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1223161.3333333333, ans=0.125 2023-10-13 00:13:37,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1223208.0, ans=0.125 2023-10-13 00:13:57,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.784e+02 1.942e+02 2.299e+02 3.428e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-13 00:14:14,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1223348.0, ans=0.2 2023-10-13 00:14:20,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1223348.0, ans=0.125 2023-10-13 00:14:41,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1223441.3333333333, ans=0.125 2023-10-13 00:14:53,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1223488.0, ans=0.125 2023-10-13 00:15:24,662 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:15:33,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1223581.3333333333, ans=0.125 2023-10-13 00:15:41,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1223628.0, ans=0.0 2023-10-13 00:15:45,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=15.0 2023-10-13 00:15:49,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1223674.6666666667, ans=0.0 2023-10-13 00:16:01,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-10-13 00:16:09,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.721e+02 1.878e+02 2.077e+02 2.891e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 00:16:11,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=22.5 2023-10-13 00:16:32,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1223814.6666666667, ans=0.125 2023-10-13 00:16:34,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1223814.6666666667, ans=0.125 2023-10-13 00:16:43,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1223861.3333333333, ans=0.0 2023-10-13 00:16:47,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1223861.3333333333, ans=0.07 2023-10-13 00:16:55,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1223908.0, ans=0.1 2023-10-13 00:17:44,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1224048.0, ans=0.125 2023-10-13 00:17:54,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2023-10-13 00:18:03,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1224094.6666666667, ans=0.2 2023-10-13 00:18:14,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1224141.3333333333, ans=0.125 2023-10-13 00:18:27,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1224188.0, ans=0.0 2023-10-13 00:18:30,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.90 vs. limit=22.5 2023-10-13 00:18:32,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.789e+02 1.988e+02 2.204e+02 2.957e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-13 00:18:36,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1224188.0, ans=0.1 2023-10-13 00:18:42,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1224234.6666666667, ans=0.125 2023-10-13 00:19:11,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1224328.0, ans=0.025 2023-10-13 00:19:13,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1224328.0, ans=0.125 2023-10-13 00:19:59,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1224514.6666666667, ans=0.125 2023-10-13 00:20:10,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1224561.3333333333, ans=0.1 2023-10-13 00:20:10,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1224561.3333333333, ans=0.0 2023-10-13 00:20:11,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1224561.3333333333, ans=0.05 2023-10-13 00:20:23,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-10-13 00:20:27,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2023-10-13 00:20:40,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1224654.6666666667, ans=0.1 2023-10-13 00:20:44,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.759e+02 1.874e+02 2.097e+02 2.796e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-13 00:20:58,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1224701.3333333333, ans=0.0 2023-10-13 00:21:01,713 INFO [train.py:1031] (3/4) Epoch 20, batch 3000, loss[loss=0.1984, simple_loss=0.2945, pruned_loss=0.05121, over 16857.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2816, pruned_loss=0.04973, over 25502099.21 frames. ], batch size: 188, lr: 1.74e-03, grad_scale: 16.0 2023-10-13 00:21:06,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1224748.0, ans=0.0 2023-10-13 00:21:11,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1224748.0, ans=0.125 2023-10-13 00:21:15,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1224794.6666666667, ans=0.0 2023-10-13 00:21:15,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1224794.6666666667, ans=0.1 2023-10-13 00:21:55,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.76 vs. limit=10.0 2023-10-13 00:21:56,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2023-10-13 00:22:48,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.798e+02 1.998e+02 2.305e+02 3.052e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-13 00:22:51,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-10-13 00:23:09,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1225214.6666666667, ans=0.125 2023-10-13 00:23:20,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1225261.3333333333, ans=0.02 2023-10-13 00:23:37,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1225308.0, ans=0.0 2023-10-13 00:23:42,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1225308.0, ans=0.125 2023-10-13 00:24:15,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=22.5 2023-10-13 00:24:15,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1225448.0, ans=0.125 2023-10-13 00:24:23,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1225494.6666666667, ans=0.5 2023-10-13 00:24:26,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-13 00:24:43,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1225588.0, ans=0.0 2023-10-13 00:24:45,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1225588.0, ans=0.0 2023-10-13 00:24:48,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1225588.0, ans=0.0 2023-10-13 00:24:49,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.745e+02 1.899e+02 2.145e+02 2.873e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-13 00:24:50,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.65 vs. limit=15.0 2023-10-13 00:24:51,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1225588.0, ans=0.125 2023-10-13 00:24:53,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1225634.6666666667, ans=0.05 2023-10-13 00:25:15,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1225728.0, ans=0.125 2023-10-13 00:25:16,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1225728.0, ans=0.125 2023-10-13 00:25:18,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1225728.0, ans=0.125 2023-10-13 00:25:19,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1225728.0, ans=0.125 2023-10-13 00:25:38,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1225774.6666666667, ans=0.125 2023-10-13 00:25:50,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1225821.3333333333, ans=0.09899494936611666 2023-10-13 00:25:50,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1225821.3333333333, ans=0.125 2023-10-13 00:26:06,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1225868.0, ans=0.125 2023-10-13 00:26:40,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1225961.3333333333, ans=0.0 2023-10-13 00:26:51,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226008.0, ans=0.1 2023-10-13 00:26:57,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1226054.6666666667, ans=0.125 2023-10-13 00:26:58,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1226054.6666666667, ans=0.2 2023-10-13 00:27:04,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.835e+02 1.998e+02 2.141e+02 3.029e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-13 00:27:24,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226148.0, ans=0.1 2023-10-13 00:27:30,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1226148.0, ans=0.125 2023-10-13 00:27:31,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1226148.0, ans=0.0 2023-10-13 00:27:46,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1226194.6666666667, ans=0.2 2023-10-13 00:27:48,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1226241.3333333333, ans=0.125 2023-10-13 00:28:13,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=22.5 2023-10-13 00:28:23,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=12.0 2023-10-13 00:28:41,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1226428.0, ans=0.125 2023-10-13 00:28:46,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-13 00:29:13,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.772e+02 1.913e+02 2.123e+02 2.620e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-13 00:29:15,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1226568.0, ans=0.04949747468305833 2023-10-13 00:29:27,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1226614.6666666667, ans=0.0 2023-10-13 00:29:38,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226661.3333333333, ans=0.1 2023-10-13 00:29:39,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226661.3333333333, ans=0.1 2023-10-13 00:29:42,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226661.3333333333, ans=0.1 2023-10-13 00:29:48,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1226661.3333333333, ans=0.1 2023-10-13 00:29:53,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1226708.0, ans=0.2 2023-10-13 00:30:04,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1226754.6666666667, ans=0.125 2023-10-13 00:30:10,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1226754.6666666667, ans=0.125 2023-10-13 00:30:10,913 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-10-13 00:30:25,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1226848.0, ans=0.0 2023-10-13 00:30:45,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-10-13 00:30:47,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1226894.6666666667, ans=0.0 2023-10-13 00:30:57,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-10-13 00:31:11,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1226988.0, ans=0.2 2023-10-13 00:31:14,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.816e+02 1.966e+02 2.170e+02 2.740e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 00:31:17,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.50 vs. limit=10.0 2023-10-13 00:31:18,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1227034.6666666667, ans=0.0 2023-10-13 00:31:29,574 INFO [train.py:1031] (3/4) Epoch 20, batch 3500, loss[loss=0.187, simple_loss=0.2553, pruned_loss=0.05936, over 12599.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2812, pruned_loss=0.04967, over 27084300.28 frames. ], batch size: 440, lr: 1.74e-03, grad_scale: 16.0 2023-10-13 00:31:32,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-10-13 00:31:38,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1227081.3333333333, ans=0.0 2023-10-13 00:31:38,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1227081.3333333333, ans=0.05 2023-10-13 00:31:41,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2023-10-13 00:31:41,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1227081.3333333333, ans=0.1 2023-10-13 00:31:42,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1227081.3333333333, ans=0.1 2023-10-13 00:31:55,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1227128.0, ans=0.0 2023-10-13 00:31:58,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1227174.6666666667, ans=0.2 2023-10-13 00:32:00,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1227174.6666666667, ans=0.125 2023-10-13 00:32:17,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1227221.3333333333, ans=0.125 2023-10-13 00:32:36,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1227314.6666666667, ans=0.09899494936611666 2023-10-13 00:33:20,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1227408.0, ans=0.0 2023-10-13 00:33:23,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1227408.0, ans=0.125 2023-10-13 00:33:34,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.819e+02 2.018e+02 2.440e+02 3.265e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-13 00:33:53,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-10-13 00:34:30,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1227688.0, ans=0.0 2023-10-13 00:34:46,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1227734.6666666667, ans=0.125 2023-10-13 00:34:47,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1227734.6666666667, ans=0.125 2023-10-13 00:34:49,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1227734.6666666667, ans=0.0 2023-10-13 00:34:59,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1227781.3333333333, ans=0.0 2023-10-13 00:35:35,409 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:35:47,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.726e+02 1.854e+02 2.049e+02 3.424e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-13 00:35:54,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1227968.0, ans=0.0 2023-10-13 00:35:58,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1227968.0, ans=0.125 2023-10-13 00:36:05,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1228014.6666666667, ans=0.125 2023-10-13 00:36:18,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1228061.3333333333, ans=0.125 2023-10-13 00:36:56,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1228154.6666666667, ans=0.0 2023-10-13 00:37:16,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=15.0 2023-10-13 00:37:35,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1228294.6666666667, ans=0.95 2023-10-13 00:38:06,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1228388.0, ans=0.2 2023-10-13 00:38:06,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.680e+02 1.875e+02 2.042e+02 2.976e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-13 00:38:09,936 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:38:10,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1228434.6666666667, ans=0.125 2023-10-13 00:38:35,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1228528.0, ans=0.0 2023-10-13 00:38:51,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1228574.6666666667, ans=0.125 2023-10-13 00:39:11,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1228668.0, ans=0.125 2023-10-13 00:39:15,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1228668.0, ans=0.125 2023-10-13 00:39:38,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1228761.3333333333, ans=0.0 2023-10-13 00:39:44,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1228761.3333333333, ans=0.125 2023-10-13 00:39:58,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1228854.6666666667, ans=0.125 2023-10-13 00:40:06,233 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:40:08,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.717e+02 1.889e+02 2.084e+02 3.098e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-13 00:40:24,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-10-13 00:40:24,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1228948.0, ans=0.2 2023-10-13 00:40:30,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1228948.0, ans=0.04949747468305833 2023-10-13 00:40:30,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-13 00:40:38,772 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.56 vs. limit=15.0 2023-10-13 00:41:07,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2023-10-13 00:41:19,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1229134.6666666667, ans=0.125 2023-10-13 00:41:40,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1229228.0, ans=0.04949747468305833 2023-10-13 00:41:43,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1229228.0, ans=0.125 2023-10-13 00:42:02,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=22.5 2023-10-13 00:42:13,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.746e+02 1.899e+02 2.168e+02 2.940e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-13 00:42:27,202 INFO [train.py:1031] (3/4) Epoch 20, batch 4000, loss[loss=0.2133, simple_loss=0.3034, pruned_loss=0.06159, over 16908.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.281, pruned_loss=0.04993, over 28336303.87 frames. ], batch size: 110, lr: 1.74e-03, grad_scale: 32.0 2023-10-13 00:42:46,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1229461.3333333333, ans=0.0 2023-10-13 00:43:05,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1229508.0, ans=0.0 2023-10-13 00:43:10,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1229554.6666666667, ans=0.0 2023-10-13 00:43:12,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1229554.6666666667, ans=0.125 2023-10-13 00:43:23,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.90 vs. limit=22.5 2023-10-13 00:43:25,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1229601.3333333333, ans=0.2 2023-10-13 00:43:29,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1229648.0, ans=0.125 2023-10-13 00:43:34,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1229648.0, ans=0.125 2023-10-13 00:43:57,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1229741.3333333333, ans=0.0 2023-10-13 00:44:01,164 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=8.0 2023-10-13 00:44:14,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.746e+02 1.878e+02 2.082e+02 2.617e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-13 00:44:30,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229881.3333333333, ans=0.1 2023-10-13 00:44:30,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=12.0 2023-10-13 00:44:38,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.77 vs. limit=10.0 2023-10-13 00:44:43,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1229928.0, ans=0.07 2023-10-13 00:45:01,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1230021.3333333333, ans=0.0 2023-10-13 00:45:03,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1230021.3333333333, ans=0.0 2023-10-13 00:45:24,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-10-13 00:46:18,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230254.6666666667, ans=0.1 2023-10-13 00:46:25,866 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.800e+02 2.017e+02 2.267e+02 3.667e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-13 00:46:40,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1230348.0, ans=0.1 2023-10-13 00:46:44,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1230348.0, ans=0.0 2023-10-13 00:47:12,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1230441.3333333333, ans=0.0 2023-10-13 00:47:15,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230488.0, ans=0.1 2023-10-13 00:47:21,626 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.36 vs. limit=15.0 2023-10-13 00:47:28,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230534.6666666667, ans=0.1 2023-10-13 00:47:53,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.85 vs. limit=12.0 2023-10-13 00:48:18,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1230721.3333333333, ans=0.0 2023-10-13 00:48:25,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230721.3333333333, ans=0.1 2023-10-13 00:48:25,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1230721.3333333333, ans=0.0 2023-10-13 00:48:25,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.770e+02 1.937e+02 2.264e+02 3.243e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-13 00:48:35,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1230768.0, ans=0.0 2023-10-13 00:48:37,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-10-13 00:48:44,748 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-13 00:48:45,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1230814.6666666667, ans=0.125 2023-10-13 00:48:50,061 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:48:59,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1230861.3333333333, ans=0.125 2023-10-13 00:49:19,137 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:49:30,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231001.3333333333, ans=0.1 2023-10-13 00:49:44,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1231048.0, ans=0.0 2023-10-13 00:49:55,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1231094.6666666667, ans=0.0 2023-10-13 00:50:09,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-13 00:50:27,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.864e+02 2.027e+02 2.305e+02 3.344e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-13 00:50:30,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1231234.6666666667, ans=0.07 2023-10-13 00:50:34,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1231234.6666666667, ans=0.125 2023-10-13 00:50:38,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1231234.6666666667, ans=0.0 2023-10-13 00:50:53,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1231281.3333333333, ans=0.125 2023-10-13 00:51:13,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2023-10-13 00:51:41,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.03 vs. limit=15.0 2023-10-13 00:52:02,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-10-13 00:52:29,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1231561.3333333333, ans=0.125 2023-10-13 00:52:29,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231561.3333333333, ans=0.1 2023-10-13 00:52:52,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1231654.6666666667, ans=0.0 2023-10-13 00:53:03,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.789e+02 1.973e+02 2.179e+02 3.350e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-13 00:53:14,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1231701.3333333333, ans=0.0 2023-10-13 00:53:17,653 INFO [train.py:1031] (3/4) Epoch 20, batch 4500, loss[loss=0.1819, simple_loss=0.2821, pruned_loss=0.04079, over 16813.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2815, pruned_loss=0.04976, over 29349516.87 frames. ], batch size: 98, lr: 1.74e-03, grad_scale: 32.0 2023-10-13 00:53:23,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1231748.0, ans=0.125 2023-10-13 00:53:25,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1231748.0, ans=0.125 2023-10-13 00:53:37,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1231794.6666666667, ans=0.125 2023-10-13 00:53:43,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1231841.3333333333, ans=0.2 2023-10-13 00:53:49,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1231841.3333333333, ans=0.0 2023-10-13 00:53:59,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1231888.0, ans=0.125 2023-10-13 00:54:02,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1231888.0, ans=0.2 2023-10-13 00:54:22,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1231981.3333333333, ans=0.125 2023-10-13 00:54:52,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1232074.6666666667, ans=0.125 2023-10-13 00:55:13,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.807e+02 1.952e+02 2.319e+02 3.341e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-13 00:55:17,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2023-10-13 00:55:19,414 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:55:35,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1232261.3333333333, ans=0.0 2023-10-13 00:55:47,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1232308.0, ans=0.0 2023-10-13 00:55:53,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1232308.0, ans=0.2 2023-10-13 00:56:09,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1232401.3333333333, ans=0.0 2023-10-13 00:56:13,915 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:56:14,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1232401.3333333333, ans=0.0 2023-10-13 00:56:17,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.85 vs. limit=10.0 2023-10-13 00:56:20,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1232448.0, ans=0.125 2023-10-13 00:56:20,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1232448.0, ans=0.125 2023-10-13 00:56:24,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-13 00:56:26,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1232448.0, ans=0.125 2023-10-13 00:56:33,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1232494.6666666667, ans=0.1 2023-10-13 00:56:44,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1232541.3333333333, ans=0.2 2023-10-13 00:56:48,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1232541.3333333333, ans=0.0 2023-10-13 00:56:53,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1232541.3333333333, ans=0.1 2023-10-13 00:57:03,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1232588.0, ans=0.125 2023-10-13 00:57:06,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.797e+02 1.996e+02 2.225e+02 3.605e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 00:57:16,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1232634.6666666667, ans=0.1 2023-10-13 00:57:35,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1232681.3333333333, ans=0.95 2023-10-13 00:57:38,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1232728.0, ans=0.125 2023-10-13 00:57:54,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1232728.0, ans=0.125 2023-10-13 00:58:00,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1232774.6666666667, ans=0.0 2023-10-13 00:58:00,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1232774.6666666667, ans=0.2 2023-10-13 00:58:18,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1232821.3333333333, ans=0.1 2023-10-13 00:58:41,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1232914.6666666667, ans=0.0 2023-10-13 00:58:59,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1233008.0, ans=0.0 2023-10-13 00:59:07,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1233054.6666666667, ans=0.0 2023-10-13 00:59:11,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1233054.6666666667, ans=0.125 2023-10-13 00:59:19,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.805e+02 1.954e+02 2.241e+02 3.244e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 00:59:22,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1233101.3333333333, ans=0.0 2023-10-13 00:59:41,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1233148.0, ans=0.2 2023-10-13 00:59:51,332 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:00:06,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1233241.3333333333, ans=0.2 2023-10-13 01:00:20,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1233288.0, ans=0.125 2023-10-13 01:00:36,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1233381.3333333333, ans=0.1 2023-10-13 01:00:40,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.62 vs. limit=6.0 2023-10-13 01:00:44,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2023-10-13 01:00:44,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1233381.3333333333, ans=0.125 2023-10-13 01:00:46,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1233428.0, ans=0.2 2023-10-13 01:01:19,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.670e+02 1.834e+02 2.022e+02 2.686e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-13 01:01:38,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1233614.6666666667, ans=0.0 2023-10-13 01:01:50,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1233661.3333333333, ans=0.125 2023-10-13 01:02:17,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1233754.6666666667, ans=15.0 2023-10-13 01:02:33,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=22.5 2023-10-13 01:02:49,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1233848.0, ans=0.0 2023-10-13 01:02:51,674 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2023-10-13 01:03:09,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1233941.3333333333, ans=0.0 2023-10-13 01:03:11,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1233941.3333333333, ans=0.125 2023-10-13 01:03:15,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.44 vs. limit=15.0 2023-10-13 01:03:20,023 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:03:32,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.760e+02 1.902e+02 2.132e+02 3.724e+02, threshold=3.804e+02, percent-clipped=1.0 2023-10-13 01:03:42,691 INFO [train.py:1031] (3/4) Epoch 20, batch 5000, loss[loss=0.2031, simple_loss=0.2908, pruned_loss=0.05766, over 16625.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2813, pruned_loss=0.04984, over 30120971.48 frames. ], batch size: 241, lr: 1.74e-03, grad_scale: 16.0 2023-10-13 01:04:18,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-10-13 01:04:21,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1234221.3333333333, ans=0.1 2023-10-13 01:04:23,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1234221.3333333333, ans=0.125 2023-10-13 01:04:32,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1234268.0, ans=0.125 2023-10-13 01:04:39,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1234268.0, ans=0.2 2023-10-13 01:04:41,135 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=12.0 2023-10-13 01:04:57,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1234361.3333333333, ans=0.0 2023-10-13 01:05:04,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1234361.3333333333, ans=0.125 2023-10-13 01:05:07,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-10-13 01:05:11,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1234408.0, ans=0.025 2023-10-13 01:05:26,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1234454.6666666667, ans=0.125 2023-10-13 01:05:32,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234454.6666666667, ans=0.1 2023-10-13 01:05:32,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1234454.6666666667, ans=0.07 2023-10-13 01:05:36,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.826e+02 1.999e+02 2.240e+02 3.305e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-13 01:05:50,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-10-13 01:05:53,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1234548.0, ans=0.0 2023-10-13 01:05:58,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1234594.6666666667, ans=0.125 2023-10-13 01:06:04,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1234594.6666666667, ans=0.0 2023-10-13 01:06:32,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1234688.0, ans=0.0 2023-10-13 01:06:39,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1234734.6666666667, ans=0.0 2023-10-13 01:06:54,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1234781.3333333333, ans=10.0 2023-10-13 01:07:23,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1234921.3333333333, ans=0.125 2023-10-13 01:07:36,557 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:07:36,650 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:07:37,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.733e+02 1.889e+02 2.151e+02 2.721e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-13 01:07:39,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1234968.0, ans=0.025 2023-10-13 01:07:58,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.12 vs. limit=10.0 2023-10-13 01:07:59,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1235061.3333333333, ans=0.0 2023-10-13 01:08:33,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1235154.6666666667, ans=0.0 2023-10-13 01:08:50,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1235201.3333333333, ans=0.95 2023-10-13 01:08:51,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1235201.3333333333, ans=0.125 2023-10-13 01:09:09,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1235294.6666666667, ans=0.05 2023-10-13 01:09:11,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1235294.6666666667, ans=0.125 2023-10-13 01:09:17,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1235294.6666666667, ans=0.125 2023-10-13 01:09:19,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-10-13 01:09:25,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1235341.3333333333, ans=10.0 2023-10-13 01:09:32,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1235341.3333333333, ans=0.125 2023-10-13 01:09:32,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1235388.0, ans=0.125 2023-10-13 01:09:49,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.816e+02 1.967e+02 2.221e+02 3.663e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-13 01:10:14,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1235528.0, ans=0.0 2023-10-13 01:10:25,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1235574.6666666667, ans=0.0 2023-10-13 01:10:39,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.27 vs. limit=10.0 2023-10-13 01:10:55,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1235668.0, ans=0.0 2023-10-13 01:10:59,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.79 vs. limit=15.0 2023-10-13 01:11:19,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1235761.3333333333, ans=0.0 2023-10-13 01:11:34,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=15.0 2023-10-13 01:11:46,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1235854.6666666667, ans=0.0 2023-10-13 01:11:49,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.85 vs. limit=15.0 2023-10-13 01:11:58,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.724e+02 1.890e+02 2.180e+02 3.031e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-13 01:12:04,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1235901.3333333333, ans=0.1 2023-10-13 01:12:22,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-10-13 01:12:34,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1236041.3333333333, ans=0.1 2023-10-13 01:12:43,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1236088.0, ans=0.0 2023-10-13 01:12:44,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.87 vs. limit=15.0 2023-10-13 01:13:04,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1236134.6666666667, ans=0.125 2023-10-13 01:13:21,610 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.14 vs. limit=22.5 2023-10-13 01:13:48,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236321.3333333333, ans=0.1 2023-10-13 01:13:51,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1236321.3333333333, ans=0.1 2023-10-13 01:13:57,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.768e+02 1.908e+02 2.135e+02 2.825e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-13 01:14:04,716 INFO [train.py:1031] (3/4) Epoch 20, batch 5500, loss[loss=0.1694, simple_loss=0.2676, pruned_loss=0.03559, over 16852.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.281, pruned_loss=0.0496, over 30712671.92 frames. ], batch size: 98, lr: 1.74e-03, grad_scale: 8.0 2023-10-13 01:14:27,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1236508.0, ans=0.125 2023-10-13 01:14:32,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236508.0, ans=0.1 2023-10-13 01:14:39,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236554.6666666667, ans=0.1 2023-10-13 01:14:43,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1236554.6666666667, ans=0.1 2023-10-13 01:15:08,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1236648.0, ans=0.125 2023-10-13 01:15:25,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1236741.3333333333, ans=0.125 2023-10-13 01:15:32,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1236741.3333333333, ans=0.125 2023-10-13 01:15:44,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236788.0, ans=0.1 2023-10-13 01:15:51,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.726e+02 1.926e+02 2.123e+02 3.044e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-13 01:15:59,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1236881.3333333333, ans=0.0 2023-10-13 01:16:04,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1236881.3333333333, ans=0.0 2023-10-13 01:16:05,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1236881.3333333333, ans=0.2 2023-10-13 01:16:05,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1236881.3333333333, ans=0.2 2023-10-13 01:16:08,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1236881.3333333333, ans=0.125 2023-10-13 01:16:21,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1236974.6666666667, ans=0.0 2023-10-13 01:16:29,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-10-13 01:17:06,288 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:17:08,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-13 01:17:13,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1237161.3333333333, ans=0.125 2023-10-13 01:17:14,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1237161.3333333333, ans=0.0 2023-10-13 01:17:15,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1237161.3333333333, ans=0.125 2023-10-13 01:17:51,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.764e+02 2.022e+02 2.250e+02 3.316e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-13 01:18:08,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237348.0, ans=0.125 2023-10-13 01:18:16,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1237394.6666666667, ans=0.125 2023-10-13 01:18:16,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1237394.6666666667, ans=0.125 2023-10-13 01:18:24,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1237441.3333333333, ans=0.025 2023-10-13 01:18:25,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1237441.3333333333, ans=0.125 2023-10-13 01:18:53,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1237534.6666666667, ans=0.125 2023-10-13 01:18:56,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237534.6666666667, ans=0.1 2023-10-13 01:19:01,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1237581.3333333333, ans=0.1 2023-10-13 01:19:04,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1237581.3333333333, ans=0.2 2023-10-13 01:19:05,053 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=22.5 2023-10-13 01:19:08,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1237581.3333333333, ans=0.0 2023-10-13 01:19:09,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1237581.3333333333, ans=0.125 2023-10-13 01:19:31,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1237674.6666666667, ans=10.0 2023-10-13 01:19:39,096 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:19:45,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237721.3333333333, ans=0.1 2023-10-13 01:19:49,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1237768.0, ans=0.0 2023-10-13 01:19:53,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.789e+02 1.973e+02 2.231e+02 4.908e+02, threshold=3.946e+02, percent-clipped=1.0 2023-10-13 01:19:53,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1237768.0, ans=0.2 2023-10-13 01:20:21,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1237861.3333333333, ans=0.125 2023-10-13 01:20:27,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1237908.0, ans=0.0 2023-10-13 01:20:32,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.29 vs. limit=6.0 2023-10-13 01:20:48,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1237954.6666666667, ans=0.2 2023-10-13 01:20:57,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238001.3333333333, ans=0.1 2023-10-13 01:21:04,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-10-13 01:21:08,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.41 vs. limit=22.5 2023-10-13 01:21:14,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1238094.6666666667, ans=0.125 2023-10-13 01:21:15,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1238094.6666666667, ans=0.2 2023-10-13 01:21:22,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1238094.6666666667, ans=0.125 2023-10-13 01:21:25,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1238141.3333333333, ans=0.05 2023-10-13 01:21:47,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1238188.0, ans=0.025 2023-10-13 01:21:47,712 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-13 01:21:52,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-10-13 01:21:54,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.750e+02 1.904e+02 2.096e+02 3.025e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 01:22:24,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1238328.0, ans=0.1 2023-10-13 01:22:53,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1238421.3333333333, ans=15.0 2023-10-13 01:23:15,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1238561.3333333333, ans=0.0 2023-10-13 01:23:18,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-10-13 01:23:28,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1238608.0, ans=0.2 2023-10-13 01:23:37,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2023-10-13 01:23:46,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1238654.6666666667, ans=0.2 2023-10-13 01:23:49,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.87 vs. limit=15.0 2023-10-13 01:23:49,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1238701.3333333333, ans=0.125 2023-10-13 01:23:54,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1238701.3333333333, ans=0.125 2023-10-13 01:23:55,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.758e+02 1.940e+02 2.168e+02 2.907e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 01:24:02,832 INFO [train.py:1031] (3/4) Epoch 20, batch 6000, loss[loss=0.1737, simple_loss=0.2772, pruned_loss=0.0351, over 16872.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2815, pruned_loss=0.04973, over 31193692.41 frames. ], batch size: 98, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 01:24:19,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1238794.6666666667, ans=0.125 2023-10-13 01:24:19,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.44 vs. limit=22.5 2023-10-13 01:24:26,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1238841.3333333333, ans=0.05 2023-10-13 01:24:29,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1238841.3333333333, ans=0.125 2023-10-13 01:24:30,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1238841.3333333333, ans=0.0 2023-10-13 01:24:33,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-10-13 01:24:56,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238934.6666666667, ans=0.1 2023-10-13 01:25:15,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1238981.3333333333, ans=0.125 2023-10-13 01:25:28,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-10-13 01:25:44,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1239121.3333333333, ans=0.0 2023-10-13 01:25:53,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.0 2023-10-13 01:25:59,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.767e+02 1.932e+02 2.129e+02 3.413e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 01:26:00,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1239168.0, ans=0.125 2023-10-13 01:26:06,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1239214.6666666667, ans=0.1 2023-10-13 01:26:20,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1239261.3333333333, ans=0.0 2023-10-13 01:26:33,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1239308.0, ans=0.0 2023-10-13 01:26:47,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1239354.6666666667, ans=0.2 2023-10-13 01:27:20,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1239494.6666666667, ans=0.125 2023-10-13 01:27:22,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1239494.6666666667, ans=0.125 2023-10-13 01:27:28,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1239541.3333333333, ans=0.5 2023-10-13 01:27:36,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1239541.3333333333, ans=0.125 2023-10-13 01:27:41,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1239541.3333333333, ans=0.1 2023-10-13 01:27:52,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1239588.0, ans=15.0 2023-10-13 01:28:01,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.853e+02 2.011e+02 2.168e+02 2.642e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-13 01:28:51,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-10-13 01:29:26,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1239961.3333333333, ans=0.125 2023-10-13 01:29:51,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.49 vs. limit=10.0 2023-10-13 01:30:00,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1240101.3333333333, ans=0.05 2023-10-13 01:30:01,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.766e+02 1.901e+02 2.094e+02 3.326e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-13 01:30:10,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1240148.0, ans=0.125 2023-10-13 01:30:41,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-10-13 01:30:42,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1240241.3333333333, ans=0.125 2023-10-13 01:30:53,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240288.0, ans=0.125 2023-10-13 01:31:21,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1240381.3333333333, ans=0.125 2023-10-13 01:31:55,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1240474.6666666667, ans=0.125 2023-10-13 01:32:22,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.778e+02 2.024e+02 2.290e+02 3.250e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-13 01:32:28,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1240614.6666666667, ans=0.1 2023-10-13 01:33:09,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1240754.6666666667, ans=0.0 2023-10-13 01:33:12,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-13 01:33:12,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-13 01:33:13,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1240754.6666666667, ans=0.125 2023-10-13 01:33:32,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1240848.0, ans=0.0 2023-10-13 01:33:35,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240848.0, ans=0.125 2023-10-13 01:33:42,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-13 01:33:43,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1240894.6666666667, ans=0.0 2023-10-13 01:33:57,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1240941.3333333333, ans=0.1 2023-10-13 01:34:21,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.704e+02 1.949e+02 2.211e+02 3.432e+02, threshold=3.898e+02, percent-clipped=0.0 2023-10-13 01:34:22,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1241034.6666666667, ans=0.1 2023-10-13 01:34:27,062 INFO [train.py:1031] (3/4) Epoch 20, batch 6500, loss[loss=0.1915, simple_loss=0.275, pruned_loss=0.05397, over 15497.00 frames. ], tot_loss[loss=0.191, simple_loss=0.282, pruned_loss=0.04996, over 31541960.88 frames. ], batch size: 35, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 01:34:37,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-10-13 01:34:37,508 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-10-13 01:34:38,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1241081.3333333333, ans=0.025 2023-10-13 01:34:53,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1241128.0, ans=0.0 2023-10-13 01:35:14,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1241221.3333333333, ans=0.125 2023-10-13 01:35:23,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-10-13 01:35:27,387 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:35:27,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1241268.0, ans=0.05 2023-10-13 01:35:55,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1241361.3333333333, ans=0.125 2023-10-13 01:35:55,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1241361.3333333333, ans=0.05 2023-10-13 01:36:02,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-10-13 01:36:11,096 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:36:13,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1241408.0, ans=0.125 2023-10-13 01:36:20,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1241454.6666666667, ans=0.125 2023-10-13 01:36:26,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-10-13 01:36:31,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1241501.3333333333, ans=0.0 2023-10-13 01:36:35,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.776e+02 1.978e+02 2.232e+02 2.664e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 01:36:35,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1241501.3333333333, ans=0.0 2023-10-13 01:36:38,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-10-13 01:36:54,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1241594.6666666667, ans=0.0 2023-10-13 01:37:16,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1241641.3333333333, ans=0.035 2023-10-13 01:37:44,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1241734.6666666667, ans=0.125 2023-10-13 01:37:45,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1241781.3333333333, ans=0.125 2023-10-13 01:37:47,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1241781.3333333333, ans=0.125 2023-10-13 01:38:13,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.74 vs. limit=22.5 2023-10-13 01:38:45,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.751e+02 1.928e+02 2.102e+02 3.317e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-13 01:39:07,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-10-13 01:39:15,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1242108.0, ans=0.125 2023-10-13 01:39:18,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1242108.0, ans=0.125 2023-10-13 01:39:26,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242154.6666666667, ans=0.1 2023-10-13 01:39:41,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1242201.3333333333, ans=0.125 2023-10-13 01:40:04,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1242294.6666666667, ans=0.125 2023-10-13 01:40:08,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1242294.6666666667, ans=0.0 2023-10-13 01:40:09,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1242294.6666666667, ans=0.1 2023-10-13 01:40:17,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1242341.3333333333, ans=0.125 2023-10-13 01:40:19,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1242341.3333333333, ans=0.0 2023-10-13 01:40:23,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1242341.3333333333, ans=0.125 2023-10-13 01:40:41,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1242388.0, ans=0.125 2023-10-13 01:40:52,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.682e+02 1.832e+02 2.087e+02 3.465e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-13 01:40:56,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1242434.6666666667, ans=0.125 2023-10-13 01:41:22,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=1242528.0, ans=12.0 2023-10-13 01:41:35,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1242574.6666666667, ans=0.0 2023-10-13 01:41:37,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1242574.6666666667, ans=0.0 2023-10-13 01:41:59,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1242668.0, ans=0.125 2023-10-13 01:42:04,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1242668.0, ans=0.0 2023-10-13 01:42:20,680 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:42:24,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1242714.6666666667, ans=0.2 2023-10-13 01:42:43,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1242808.0, ans=0.0 2023-10-13 01:42:51,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1242808.0, ans=0.2 2023-10-13 01:43:13,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.709e+02 1.916e+02 2.092e+02 3.062e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-13 01:43:39,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1242994.6666666667, ans=0.2 2023-10-13 01:43:50,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1243041.3333333333, ans=0.0 2023-10-13 01:44:25,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1243181.3333333333, ans=0.125 2023-10-13 01:44:35,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-10-13 01:44:52,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1243274.6666666667, ans=10.0 2023-10-13 01:44:56,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1243274.6666666667, ans=0.125 2023-10-13 01:45:16,066 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:45:19,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.860e+02 2.126e+02 2.378e+02 2.890e+02, threshold=4.252e+02, percent-clipped=0.0 2023-10-13 01:45:24,799 INFO [train.py:1031] (3/4) Epoch 20, batch 7000, loss[loss=0.1784, simple_loss=0.2703, pruned_loss=0.04321, over 16820.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2825, pruned_loss=0.04988, over 31812508.26 frames. ], batch size: 72, lr: 1.73e-03, grad_scale: 32.0 2023-10-13 01:45:28,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1243414.6666666667, ans=0.125 2023-10-13 01:45:47,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1243461.3333333333, ans=0.025 2023-10-13 01:45:48,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-10-13 01:46:07,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-10-13 01:46:10,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.79 vs. limit=15.0 2023-10-13 01:46:17,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1243601.3333333333, ans=0.0 2023-10-13 01:46:34,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.08 vs. limit=22.5 2023-10-13 01:47:11,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1243788.0, ans=0.0 2023-10-13 01:47:14,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1243788.0, ans=0.125 2023-10-13 01:47:16,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=15.0 2023-10-13 01:47:22,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.773e+02 1.905e+02 2.110e+02 2.844e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-13 01:47:44,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1243928.0, ans=0.125 2023-10-13 01:48:43,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1244161.3333333333, ans=0.05 2023-10-13 01:48:50,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1244208.0, ans=0.125 2023-10-13 01:49:22,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.831e+02 2.004e+02 2.193e+02 3.259e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-13 01:49:40,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1244348.0, ans=0.2 2023-10-13 01:49:50,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1244394.6666666667, ans=0.0 2023-10-13 01:50:03,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1244394.6666666667, ans=0.0 2023-10-13 01:50:19,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1244488.0, ans=22.5 2023-10-13 01:50:22,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1244488.0, ans=0.0 2023-10-13 01:50:25,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1244488.0, ans=0.015 2023-10-13 01:50:44,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1244534.6666666667, ans=0.125 2023-10-13 01:50:46,073 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:51:05,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1244628.0, ans=0.0 2023-10-13 01:51:38,611 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:51:40,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1244768.0, ans=0.04949747468305833 2023-10-13 01:51:46,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.699e+02 1.864e+02 2.091e+02 3.274e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-13 01:51:46,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1244768.0, ans=0.0 2023-10-13 01:51:56,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1244814.6666666667, ans=0.2 2023-10-13 01:52:02,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1244814.6666666667, ans=0.0 2023-10-13 01:52:11,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-10-13 01:52:15,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1244861.3333333333, ans=0.125 2023-10-13 01:52:49,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1245001.3333333333, ans=0.125 2023-10-13 01:52:49,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=15.0 2023-10-13 01:53:04,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1245048.0, ans=0.0 2023-10-13 01:53:04,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245048.0, ans=0.1 2023-10-13 01:53:04,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1245048.0, ans=0.125 2023-10-13 01:53:05,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1245048.0, ans=0.0 2023-10-13 01:53:27,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1245141.3333333333, ans=0.125 2023-10-13 01:53:38,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=22.5 2023-10-13 01:53:53,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=15.0 2023-10-13 01:53:57,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.737e+02 1.886e+02 2.129e+02 2.813e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-13 01:53:58,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1245234.6666666667, ans=10.0 2023-10-13 01:54:12,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1245281.3333333333, ans=0.1 2023-10-13 01:54:14,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-10-13 01:54:20,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1245328.0, ans=0.0 2023-10-13 01:54:35,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.01 vs. limit=15.0 2023-10-13 01:54:37,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1245374.6666666667, ans=0.125 2023-10-13 01:54:38,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1245421.3333333333, ans=0.125 2023-10-13 01:54:57,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1245468.0, ans=0.2 2023-10-13 01:55:14,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1245561.3333333333, ans=0.125 2023-10-13 01:55:14,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1245561.3333333333, ans=0.125 2023-10-13 01:55:16,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-10-13 01:55:24,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1245561.3333333333, ans=0.125 2023-10-13 01:55:27,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-13 01:55:28,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1245608.0, ans=0.125 2023-10-13 01:55:40,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1245654.6666666667, ans=0.0 2023-10-13 01:55:49,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1245654.6666666667, ans=0.07 2023-10-13 01:55:59,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.791e+02 1.965e+02 2.184e+02 2.907e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-13 01:56:04,494 INFO [train.py:1031] (3/4) Epoch 20, batch 7500, loss[loss=0.1929, simple_loss=0.2875, pruned_loss=0.0491, over 16823.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2828, pruned_loss=0.05019, over 32024252.21 frames. ], batch size: 188, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 01:56:05,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=12.0 2023-10-13 01:56:07,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1245748.0, ans=0.0 2023-10-13 01:56:12,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1245748.0, ans=0.1 2023-10-13 01:56:30,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1245841.3333333333, ans=0.125 2023-10-13 01:56:35,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1245841.3333333333, ans=0.125 2023-10-13 01:56:57,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1245934.6666666667, ans=0.125 2023-10-13 01:58:00,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1246168.0, ans=0.125 2023-10-13 01:58:02,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.672e+02 1.852e+02 2.071e+02 2.685e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-13 01:58:02,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1246168.0, ans=0.125 2023-10-13 01:58:15,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1246214.6666666667, ans=0.2 2023-10-13 01:58:33,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1246308.0, ans=0.125 2023-10-13 01:58:49,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1246354.6666666667, ans=0.125 2023-10-13 01:59:13,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1246401.3333333333, ans=0.125 2023-10-13 01:59:33,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1246494.6666666667, ans=0.2 2023-10-13 01:59:35,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1246494.6666666667, ans=0.125 2023-10-13 01:59:50,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1246541.3333333333, ans=0.125 2023-10-13 01:59:53,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1246588.0, ans=0.5 2023-10-13 01:59:57,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1246588.0, ans=0.125 2023-10-13 02:00:14,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.762e+02 1.948e+02 2.201e+02 2.777e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-13 02:00:16,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1246681.3333333333, ans=0.125 2023-10-13 02:00:22,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-10-13 02:00:48,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.50 vs. limit=10.0 2023-10-13 02:00:56,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1246821.3333333333, ans=0.125 2023-10-13 02:01:42,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1247008.0, ans=0.125 2023-10-13 02:01:50,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1247008.0, ans=0.0 2023-10-13 02:02:01,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1247054.6666666667, ans=0.0 2023-10-13 02:02:04,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247054.6666666667, ans=0.1 2023-10-13 02:02:14,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1247101.3333333333, ans=0.0 2023-10-13 02:02:18,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.818e+02 1.994e+02 2.287e+02 3.379e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-13 02:02:20,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1247148.0, ans=0.125 2023-10-13 02:02:34,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=15.0 2023-10-13 02:02:45,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.49 vs. limit=15.0 2023-10-13 02:02:46,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1247194.6666666667, ans=0.0 2023-10-13 02:03:34,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1247334.6666666667, ans=0.125 2023-10-13 02:03:36,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1247381.3333333333, ans=0.125 2023-10-13 02:03:47,750 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:03:48,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-10-13 02:03:49,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1247428.0, ans=0.125 2023-10-13 02:03:52,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247428.0, ans=0.1 2023-10-13 02:04:02,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1247474.6666666667, ans=10.0 2023-10-13 02:04:15,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1247521.3333333333, ans=0.07 2023-10-13 02:04:33,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.765e+02 1.894e+02 2.122e+02 3.118e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-13 02:04:48,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1247614.6666666667, ans=0.125 2023-10-13 02:04:58,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1247661.3333333333, ans=0.0 2023-10-13 02:04:59,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1247661.3333333333, ans=0.025 2023-10-13 02:05:01,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1247708.0, ans=0.0 2023-10-13 02:05:17,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1247754.6666666667, ans=0.07 2023-10-13 02:05:22,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.94 vs. limit=15.0 2023-10-13 02:06:02,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1247894.6666666667, ans=0.0 2023-10-13 02:06:15,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1247941.3333333333, ans=0.125 2023-10-13 02:06:22,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1247988.0, ans=0.125 2023-10-13 02:06:31,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-13 02:06:33,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1248034.6666666667, ans=0.125 2023-10-13 02:06:39,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1248034.6666666667, ans=0.125 2023-10-13 02:06:39,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1248034.6666666667, ans=0.0 2023-10-13 02:06:40,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1248034.6666666667, ans=0.0 2023-10-13 02:06:41,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.664e+02 1.829e+02 2.080e+02 2.499e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-13 02:06:44,576 INFO [train.py:1031] (3/4) Epoch 20, batch 8000, loss[loss=0.1926, simple_loss=0.2818, pruned_loss=0.05173, over 16392.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2821, pruned_loss=0.04962, over 32188567.83 frames. ], batch size: 50, lr: 1.73e-03, grad_scale: 32.0 2023-10-13 02:06:50,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1248081.3333333333, ans=0.125 2023-10-13 02:06:58,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248128.0, ans=0.1 2023-10-13 02:07:03,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1248128.0, ans=0.125 2023-10-13 02:07:04,478 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:07:10,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1248174.6666666667, ans=0.0 2023-10-13 02:07:23,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-10-13 02:07:54,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1248361.3333333333, ans=0.0 2023-10-13 02:08:00,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.40 vs. limit=22.5 2023-10-13 02:08:01,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1248408.0, ans=0.0 2023-10-13 02:08:02,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1248408.0, ans=0.025 2023-10-13 02:08:18,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248454.6666666667, ans=0.1 2023-10-13 02:08:32,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1248501.3333333333, ans=0.125 2023-10-13 02:08:33,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.781e+02 2.006e+02 2.355e+02 3.461e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-13 02:08:33,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1248501.3333333333, ans=0.125 2023-10-13 02:08:48,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1248594.6666666667, ans=0.125 2023-10-13 02:08:50,494 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0 2023-10-13 02:09:11,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1248688.0, ans=0.0 2023-10-13 02:09:13,808 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:09:35,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1248734.6666666667, ans=0.125 2023-10-13 02:10:24,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1248874.6666666667, ans=0.1 2023-10-13 02:10:43,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1248921.3333333333, ans=0.125 2023-10-13 02:10:50,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1248968.0, ans=0.1 2023-10-13 02:10:50,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.02 vs. limit=12.0 2023-10-13 02:10:55,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.706e+02 1.874e+02 2.107e+02 3.729e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-13 02:10:59,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.19 vs. limit=22.5 2023-10-13 02:11:04,293 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:11:25,005 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:11:25,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1249108.0, ans=0.0 2023-10-13 02:11:26,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1249108.0, ans=0.0 2023-10-13 02:11:43,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1249154.6666666667, ans=0.125 2023-10-13 02:11:43,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1249154.6666666667, ans=0.04949747468305833 2023-10-13 02:11:49,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1249201.3333333333, ans=0.0 2023-10-13 02:11:50,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1249201.3333333333, ans=0.0 2023-10-13 02:12:01,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1249248.0, ans=0.2 2023-10-13 02:12:38,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1249388.0, ans=10.0 2023-10-13 02:12:40,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1249388.0, ans=0.2 2023-10-13 02:12:50,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1249434.6666666667, ans=0.2 2023-10-13 02:13:01,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.741e+02 1.945e+02 2.141e+02 3.420e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 02:13:01,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1249434.6666666667, ans=0.125 2023-10-13 02:13:08,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1249481.3333333333, ans=0.125 2023-10-13 02:13:57,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1249668.0, ans=0.0 2023-10-13 02:14:01,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1249714.6666666667, ans=0.2 2023-10-13 02:14:01,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1249714.6666666667, ans=0.1 2023-10-13 02:14:11,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1249714.6666666667, ans=0.125 2023-10-13 02:14:13,350 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:14:13,622 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=15.0 2023-10-13 02:14:50,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1249901.3333333333, ans=0.0 2023-10-13 02:14:58,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.781e+02 1.902e+02 2.167e+02 3.193e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-13 02:16:09,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1250181.3333333333, ans=0.0 2023-10-13 02:16:12,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1250181.3333333333, ans=0.125 2023-10-13 02:16:14,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1250181.3333333333, ans=0.125 2023-10-13 02:16:41,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1250274.6666666667, ans=0.0 2023-10-13 02:16:46,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-10-13 02:16:55,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1250321.3333333333, ans=0.125 2023-10-13 02:17:01,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1250368.0, ans=0.125 2023-10-13 02:17:09,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1250368.0, ans=0.015 2023-10-13 02:17:13,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.729e+02 1.880e+02 2.122e+02 2.886e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-13 02:17:14,485 INFO [train.py:1031] (3/4) Epoch 20, batch 8500, loss[loss=0.2025, simple_loss=0.2931, pruned_loss=0.05602, over 16896.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2821, pruned_loss=0.04936, over 32319139.12 frames. ], batch size: 130, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 02:17:23,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1250414.6666666667, ans=0.1 2023-10-13 02:17:35,614 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=22.5 2023-10-13 02:18:00,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1250554.6666666667, ans=0.125 2023-10-13 02:18:32,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1250694.6666666667, ans=0.125 2023-10-13 02:18:53,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1250741.3333333333, ans=0.125 2023-10-13 02:19:02,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1250788.0, ans=0.125 2023-10-13 02:19:18,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1250834.6666666667, ans=0.2 2023-10-13 02:19:23,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.818e+02 2.012e+02 2.364e+02 3.735e+02, threshold=4.024e+02, percent-clipped=0.0 2023-10-13 02:19:42,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1250928.0, ans=0.0 2023-10-13 02:19:53,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1250974.6666666667, ans=0.0 2023-10-13 02:20:14,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2023-10-13 02:20:49,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1251161.3333333333, ans=0.0 2023-10-13 02:21:05,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1251208.0, ans=0.0 2023-10-13 02:21:27,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2023-10-13 02:21:29,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1251301.3333333333, ans=0.0 2023-10-13 02:21:37,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1251348.0, ans=0.035 2023-10-13 02:21:38,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.737e+02 1.932e+02 2.242e+02 3.015e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 02:22:04,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.87 vs. limit=10.0 2023-10-13 02:22:25,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1251488.0, ans=0.125 2023-10-13 02:22:27,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1251488.0, ans=0.2 2023-10-13 02:22:45,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1251581.3333333333, ans=0.2 2023-10-13 02:22:45,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1251581.3333333333, ans=0.0 2023-10-13 02:23:10,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1251628.0, ans=0.2 2023-10-13 02:23:23,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1251674.6666666667, ans=0.0 2023-10-13 02:23:27,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1251674.6666666667, ans=0.125 2023-10-13 02:23:55,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.716e+02 1.886e+02 2.189e+02 3.700e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-13 02:24:01,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1251814.6666666667, ans=0.125 2023-10-13 02:24:57,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1252001.3333333333, ans=0.2 2023-10-13 02:24:59,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1252048.0, ans=0.0 2023-10-13 02:25:13,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1252094.6666666667, ans=0.125 2023-10-13 02:25:14,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1252094.6666666667, ans=0.1 2023-10-13 02:25:37,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1252188.0, ans=0.1 2023-10-13 02:25:37,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1252188.0, ans=0.125 2023-10-13 02:25:45,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.18 vs. limit=15.0 2023-10-13 02:25:48,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-13 02:25:59,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.795e+02 1.967e+02 2.153e+02 3.302e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-13 02:26:46,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1252468.0, ans=0.125 2023-10-13 02:26:49,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1252468.0, ans=0.0 2023-10-13 02:26:52,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.38 vs. limit=10.0 2023-10-13 02:27:06,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1252561.3333333333, ans=0.09899494936611666 2023-10-13 02:27:18,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1252608.0, ans=0.0 2023-10-13 02:27:26,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1252608.0, ans=10.0 2023-10-13 02:27:34,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1252654.6666666667, ans=0.125 2023-10-13 02:27:54,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.818e+02 1.957e+02 2.206e+02 3.756e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-13 02:27:54,993 INFO [train.py:1031] (3/4) Epoch 20, batch 9000, loss[loss=0.2002, simple_loss=0.2918, pruned_loss=0.0543, over 17035.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2815, pruned_loss=0.04909, over 32429541.58 frames. ], batch size: 77, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 02:28:01,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-13 02:28:09,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-10-13 02:28:32,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1252888.0, ans=0.2 2023-10-13 02:29:05,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1253028.0, ans=0.1 2023-10-13 02:29:07,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1253028.0, ans=0.125 2023-10-13 02:29:10,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1253028.0, ans=0.0 2023-10-13 02:29:22,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1253074.6666666667, ans=0.125 2023-10-13 02:29:36,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1253121.3333333333, ans=0.125 2023-10-13 02:29:52,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.755e+02 1.935e+02 2.151e+02 3.003e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-13 02:30:07,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1253261.3333333333, ans=0.0 2023-10-13 02:30:34,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1253354.6666666667, ans=0.125 2023-10-13 02:30:42,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1253401.3333333333, ans=0.0 2023-10-13 02:30:59,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1253448.0, ans=0.125 2023-10-13 02:31:07,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1253494.6666666667, ans=0.125 2023-10-13 02:31:38,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1253588.0, ans=0.2 2023-10-13 02:31:55,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.809e+02 2.009e+02 2.263e+02 3.190e+02, threshold=4.017e+02, percent-clipped=0.0 2023-10-13 02:31:55,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1253681.3333333333, ans=0.125 2023-10-13 02:32:07,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1253728.0, ans=0.125 2023-10-13 02:32:15,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1253774.6666666667, ans=0.125 2023-10-13 02:32:16,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1253774.6666666667, ans=0.0 2023-10-13 02:32:20,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1253774.6666666667, ans=0.0 2023-10-13 02:32:31,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=22.5 2023-10-13 02:32:45,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1253868.0, ans=0.2 2023-10-13 02:32:55,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-10-13 02:33:10,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1253961.3333333333, ans=0.125 2023-10-13 02:33:24,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1254008.0, ans=0.2 2023-10-13 02:33:32,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1254054.6666666667, ans=0.07 2023-10-13 02:33:52,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1254101.3333333333, ans=0.2 2023-10-13 02:33:56,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.877e+02 2.108e+02 2.430e+02 3.193e+02, threshold=4.215e+02, percent-clipped=0.0 2023-10-13 02:34:02,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1254148.0, ans=22.5 2023-10-13 02:34:05,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1254194.6666666667, ans=0.2 2023-10-13 02:34:05,980 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:34:17,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1254241.3333333333, ans=0.0 2023-10-13 02:34:19,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254241.3333333333, ans=0.1 2023-10-13 02:34:58,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1254381.3333333333, ans=0.125 2023-10-13 02:35:11,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1254428.0, ans=0.125 2023-10-13 02:35:22,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254428.0, ans=0.1 2023-10-13 02:35:22,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1254428.0, ans=0.07 2023-10-13 02:35:29,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.94 vs. limit=10.0 2023-10-13 02:35:31,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1254474.6666666667, ans=0.0 2023-10-13 02:35:45,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.95 vs. limit=15.0 2023-10-13 02:35:52,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=22.5 2023-10-13 02:36:05,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1254568.0, ans=10.0 2023-10-13 02:36:10,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.10 vs. limit=12.0 2023-10-13 02:36:11,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.796e+02 1.954e+02 2.171e+02 2.877e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 02:36:54,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1254754.6666666667, ans=0.0 2023-10-13 02:36:57,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1254754.6666666667, ans=0.0 2023-10-13 02:37:08,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254754.6666666667, ans=0.1 2023-10-13 02:37:14,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1254801.3333333333, ans=0.125 2023-10-13 02:37:50,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1254941.3333333333, ans=0.125 2023-10-13 02:38:02,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1254941.3333333333, ans=0.125 2023-10-13 02:38:27,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1255034.6666666667, ans=0.0 2023-10-13 02:38:28,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1255034.6666666667, ans=0.125 2023-10-13 02:38:33,074 INFO [train.py:1031] (3/4) Epoch 20, batch 9500, loss[loss=0.1657, simple_loss=0.2533, pruned_loss=0.03902, over 16598.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.282, pruned_loss=0.04932, over 32491554.16 frames. ], batch size: 66, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 02:38:34,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.781e+02 1.951e+02 2.224e+02 2.931e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 02:39:29,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-10-13 02:39:51,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1255361.3333333333, ans=0.0 2023-10-13 02:40:18,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1255454.6666666667, ans=0.0 2023-10-13 02:40:37,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1255501.3333333333, ans=0.125 2023-10-13 02:40:43,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1255548.0, ans=0.1 2023-10-13 02:40:43,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.736e+02 1.942e+02 2.299e+02 5.234e+02, threshold=3.884e+02, percent-clipped=2.0 2023-10-13 02:41:17,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255641.3333333333, ans=0.1 2023-10-13 02:41:27,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1255688.0, ans=0.2 2023-10-13 02:41:27,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1255688.0, ans=0.125 2023-10-13 02:41:49,117 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:41:52,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1255781.3333333333, ans=0.125 2023-10-13 02:42:23,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-10-13 02:42:46,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.735e+02 1.873e+02 2.068e+02 2.626e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-13 02:42:47,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1256014.6666666667, ans=0.0 2023-10-13 02:43:24,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1256154.6666666667, ans=0.125 2023-10-13 02:43:37,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1256201.3333333333, ans=0.04949747468305833 2023-10-13 02:43:49,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1256248.0, ans=0.0 2023-10-13 02:43:49,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1256248.0, ans=0.2 2023-10-13 02:44:03,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1256294.6666666667, ans=0.95 2023-10-13 02:44:17,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1256341.3333333333, ans=0.125 2023-10-13 02:44:51,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1256434.6666666667, ans=0.125 2023-10-13 02:44:59,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.721e+02 1.903e+02 2.089e+02 2.973e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-13 02:45:17,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.43 vs. limit=15.0 2023-10-13 02:45:22,112 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.77 vs. limit=10.0 2023-10-13 02:45:28,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-10-13 02:45:56,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1256668.0, ans=0.125 2023-10-13 02:46:04,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1256714.6666666667, ans=0.2 2023-10-13 02:46:16,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1256714.6666666667, ans=0.125 2023-10-13 02:46:29,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=22.5 2023-10-13 02:46:41,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1256808.0, ans=0.0 2023-10-13 02:47:03,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1256901.3333333333, ans=0.05 2023-10-13 02:47:14,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.701e+02 1.818e+02 1.988e+02 2.797e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-13 02:47:50,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1257041.3333333333, ans=0.0 2023-10-13 02:48:00,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1257088.0, ans=0.125 2023-10-13 02:48:15,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1257134.6666666667, ans=0.0 2023-10-13 02:48:21,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1257181.3333333333, ans=0.125 2023-10-13 02:48:38,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1257228.0, ans=0.125 2023-10-13 02:48:40,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1257228.0, ans=0.125 2023-10-13 02:48:50,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1257274.6666666667, ans=0.09899494936611666 2023-10-13 02:49:08,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257368.0, ans=0.1 2023-10-13 02:49:09,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.81 vs. limit=6.0 2023-10-13 02:49:10,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1257368.0, ans=0.2 2023-10-13 02:49:22,227 INFO [train.py:1031] (3/4) Epoch 20, batch 10000, loss[loss=0.1954, simple_loss=0.2846, pruned_loss=0.05312, over 16950.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2812, pruned_loss=0.04901, over 32570629.80 frames. ], batch size: 123, lr: 1.72e-03, grad_scale: 32.0 2023-10-13 02:49:24,293 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.758e+02 2.009e+02 2.264e+02 3.112e+02, threshold=4.017e+02, percent-clipped=0.0 2023-10-13 02:49:33,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1257414.6666666667, ans=0.125 2023-10-13 02:49:36,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1257461.3333333333, ans=0.125 2023-10-13 02:49:53,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1257508.0, ans=0.2 2023-10-13 02:50:03,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1257554.6666666667, ans=0.125 2023-10-13 02:50:04,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1257554.6666666667, ans=0.0 2023-10-13 02:50:10,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-10-13 02:50:26,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1257601.3333333333, ans=0.0 2023-10-13 02:50:53,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1257694.6666666667, ans=0.2 2023-10-13 02:51:08,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1257741.3333333333, ans=0.125 2023-10-13 02:51:08,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=12.0 2023-10-13 02:51:24,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1257788.0, ans=0.2 2023-10-13 02:51:38,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=22.5 2023-10-13 02:51:41,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1257881.3333333333, ans=0.125 2023-10-13 02:51:44,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.775e+02 1.903e+02 2.144e+02 4.038e+02, threshold=3.807e+02, percent-clipped=1.0 2023-10-13 02:51:46,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1257881.3333333333, ans=0.125 2023-10-13 02:51:47,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1257881.3333333333, ans=0.125 2023-10-13 02:52:22,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1257974.6666666667, ans=0.0 2023-10-13 02:52:39,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1258021.3333333333, ans=0.0 2023-10-13 02:52:50,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1258068.0, ans=0.125 2023-10-13 02:52:50,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=15.0 2023-10-13 02:52:57,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1258068.0, ans=0.0 2023-10-13 02:53:05,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1258114.6666666667, ans=0.1 2023-10-13 02:53:13,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1258114.6666666667, ans=0.0 2023-10-13 02:53:24,854 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:53:28,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1258161.3333333333, ans=0.04949747468305833 2023-10-13 02:53:39,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1258208.0, ans=0.125 2023-10-13 02:53:45,402 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:53:46,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1258208.0, ans=10.0 2023-10-13 02:54:02,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1258301.3333333333, ans=0.0 2023-10-13 02:54:02,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-10-13 02:54:06,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1258301.3333333333, ans=0.0 2023-10-13 02:54:08,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1258301.3333333333, ans=0.0 2023-10-13 02:54:19,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.740e+02 1.872e+02 2.130e+02 2.834e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-13 02:54:34,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1258394.6666666667, ans=0.0 2023-10-13 02:54:44,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1258394.6666666667, ans=0.125 2023-10-13 02:54:51,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1258441.3333333333, ans=0.125 2023-10-13 02:55:07,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1258488.0, ans=0.2 2023-10-13 02:55:15,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1258534.6666666667, ans=0.0 2023-10-13 02:55:36,070 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:55:43,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1258628.0, ans=0.0 2023-10-13 02:56:02,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1258674.6666666667, ans=0.0 2023-10-13 02:56:03,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1258674.6666666667, ans=0.125 2023-10-13 02:56:10,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1258721.3333333333, ans=0.125 2023-10-13 02:56:25,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1258768.0, ans=0.125 2023-10-13 02:56:42,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.807e+02 1.991e+02 2.225e+02 3.102e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-13 02:57:19,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1258908.0, ans=0.125 2023-10-13 02:58:35,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1259188.0, ans=0.2 2023-10-13 02:58:36,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2023-10-13 02:58:40,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1259188.0, ans=0.125 2023-10-13 02:58:48,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1259234.6666666667, ans=0.2 2023-10-13 02:58:54,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.50 vs. limit=12.0 2023-10-13 02:59:07,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.780e+02 1.945e+02 2.203e+02 3.487e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 02:59:26,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1259328.0, ans=0.0 2023-10-13 02:59:32,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1259374.6666666667, ans=0.0 2023-10-13 02:59:36,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1259374.6666666667, ans=0.0 2023-10-13 02:59:37,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1259374.6666666667, ans=0.0 2023-10-13 02:59:37,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1259374.6666666667, ans=0.2 2023-10-13 02:59:50,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1259421.3333333333, ans=0.125 2023-10-13 02:59:54,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1259421.3333333333, ans=0.125 2023-10-13 03:00:33,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-10-13 03:00:47,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1259608.0, ans=0.0 2023-10-13 03:00:58,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1259608.0, ans=0.125 2023-10-13 03:01:29,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1259701.3333333333, ans=0.0 2023-10-13 03:01:34,784 INFO [train.py:1031] (3/4) Epoch 20, batch 10500, loss[loss=0.1937, simple_loss=0.2781, pruned_loss=0.0546, over 16180.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2817, pruned_loss=0.04916, over 32635618.81 frames. ], batch size: 44, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 03:01:37,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.735e+02 1.885e+02 2.110e+02 3.035e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 03:01:42,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2023-10-13 03:01:43,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-13 03:02:18,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1259888.0, ans=0.125 2023-10-13 03:02:28,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1259934.6666666667, ans=0.0 2023-10-13 03:02:29,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1259934.6666666667, ans=0.0 2023-10-13 03:02:32,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1259934.6666666667, ans=0.125 2023-10-13 03:02:47,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1259981.3333333333, ans=0.1 2023-10-13 03:02:48,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1259981.3333333333, ans=0.1 2023-10-13 03:03:31,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.83 vs. limit=10.0 2023-10-13 03:03:33,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1260121.3333333333, ans=0.125 2023-10-13 03:04:08,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.717e+02 1.874e+02 2.102e+02 2.853e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 03:04:23,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1260261.3333333333, ans=0.0 2023-10-13 03:05:11,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1260401.3333333333, ans=0.2 2023-10-13 03:05:26,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1260448.0, ans=0.0 2023-10-13 03:05:53,974 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:06:35,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1260634.6666666667, ans=0.2 2023-10-13 03:06:53,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.709e+02 1.859e+02 2.065e+02 3.114e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-13 03:07:01,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-10-13 03:07:04,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2023-10-13 03:07:17,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1260728.0, ans=0.125 2023-10-13 03:07:29,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-10-13 03:07:49,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1260821.3333333333, ans=0.125 2023-10-13 03:07:52,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1260868.0, ans=0.125 2023-10-13 03:08:04,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1260868.0, ans=0.07 2023-10-13 03:08:30,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1260961.3333333333, ans=0.125 2023-10-13 03:08:32,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1260961.3333333333, ans=0.1 2023-10-13 03:08:39,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1260961.3333333333, ans=0.2 2023-10-13 03:08:42,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1261008.0, ans=0.07 2023-10-13 03:09:20,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1261101.3333333333, ans=0.0 2023-10-13 03:09:27,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.764e+02 1.911e+02 2.103e+02 2.798e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-13 03:09:35,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1261148.0, ans=0.125 2023-10-13 03:10:30,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-10-13 03:10:48,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1261428.0, ans=0.0 2023-10-13 03:10:57,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-10-13 03:11:02,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1261474.6666666667, ans=0.0 2023-10-13 03:11:12,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.18 vs. limit=15.0 2023-10-13 03:11:20,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1261521.3333333333, ans=0.0 2023-10-13 03:11:31,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1261568.0, ans=0.125 2023-10-13 03:11:41,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1261568.0, ans=0.0 2023-10-13 03:11:54,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.716e+02 1.869e+02 2.080e+02 2.725e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-13 03:12:21,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2023-10-13 03:12:28,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1261708.0, ans=0.0 2023-10-13 03:12:50,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1261754.6666666667, ans=0.2 2023-10-13 03:13:10,605 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:13:11,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.39 vs. limit=22.5 2023-10-13 03:13:30,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1261894.6666666667, ans=0.1 2023-10-13 03:13:34,765 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-13 03:13:35,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1261894.6666666667, ans=0.125 2023-10-13 03:13:44,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1261941.3333333333, ans=0.125 2023-10-13 03:13:50,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1261941.3333333333, ans=0.0 2023-10-13 03:14:04,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.60 vs. limit=15.0 2023-10-13 03:14:25,577 INFO [train.py:1031] (3/4) Epoch 20, batch 11000, loss[loss=0.2095, simple_loss=0.3024, pruned_loss=0.05825, over 16677.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2817, pruned_loss=0.04916, over 32671575.50 frames. ], batch size: 202, lr: 1.72e-03, grad_scale: 32.0 2023-10-13 03:14:28,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.773e+02 1.923e+02 2.227e+02 3.061e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-13 03:14:38,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262128.0, ans=0.1 2023-10-13 03:14:47,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1262128.0, ans=10.0 2023-10-13 03:14:51,843 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.83 vs. limit=15.0 2023-10-13 03:15:08,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-10-13 03:15:08,326 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=22.5 2023-10-13 03:15:09,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1262221.3333333333, ans=0.125 2023-10-13 03:15:27,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1262268.0, ans=0.125 2023-10-13 03:15:41,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-10-13 03:15:51,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1262314.6666666667, ans=0.125 2023-10-13 03:16:04,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262361.3333333333, ans=0.1 2023-10-13 03:16:54,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1262501.3333333333, ans=0.125 2023-10-13 03:17:04,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1262548.0, ans=0.0 2023-10-13 03:17:06,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.789e+02 1.986e+02 2.277e+02 3.191e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 03:17:14,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1262548.0, ans=0.0 2023-10-13 03:17:35,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1262641.3333333333, ans=0.0 2023-10-13 03:17:38,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2023-10-13 03:17:49,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0 2023-10-13 03:18:02,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262688.0, ans=0.1 2023-10-13 03:18:05,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1262734.6666666667, ans=0.0 2023-10-13 03:18:23,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262781.3333333333, ans=0.1 2023-10-13 03:18:28,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1262781.3333333333, ans=0.1 2023-10-13 03:18:46,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262828.0, ans=0.1 2023-10-13 03:18:46,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1262828.0, ans=0.125 2023-10-13 03:19:06,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1262874.6666666667, ans=0.015 2023-10-13 03:19:21,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262921.3333333333, ans=0.1 2023-10-13 03:19:30,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1262921.3333333333, ans=0.125 2023-10-13 03:19:36,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1262968.0, ans=0.125 2023-10-13 03:19:49,060 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2023-10-13 03:19:56,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.789e+02 1.996e+02 2.267e+02 3.207e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 03:19:59,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1263014.6666666667, ans=0.05 2023-10-13 03:20:02,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=22.5 2023-10-13 03:20:10,375 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-10-13 03:20:29,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.34 vs. limit=10.0 2023-10-13 03:20:37,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1263154.6666666667, ans=0.0 2023-10-13 03:20:40,733 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:21:33,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1263341.3333333333, ans=0.1 2023-10-13 03:21:39,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-10-13 03:21:40,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1263341.3333333333, ans=0.0 2023-10-13 03:21:41,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1263341.3333333333, ans=0.125 2023-10-13 03:21:53,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.54 vs. limit=10.0 2023-10-13 03:22:04,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1263434.6666666667, ans=0.035 2023-10-13 03:22:08,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.22 vs. limit=15.0 2023-10-13 03:22:13,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.706e+02 1.905e+02 2.091e+02 2.802e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-13 03:22:21,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1263528.0, ans=0.125 2023-10-13 03:22:23,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1263528.0, ans=0.0 2023-10-13 03:22:46,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1263621.3333333333, ans=0.04949747468305833 2023-10-13 03:22:50,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1263621.3333333333, ans=0.0 2023-10-13 03:23:10,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1263714.6666666667, ans=0.125 2023-10-13 03:23:20,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1263761.3333333333, ans=0.125 2023-10-13 03:23:42,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1263808.0, ans=0.0 2023-10-13 03:24:13,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.770e+02 1.930e+02 2.181e+02 3.103e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-13 03:24:14,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1263948.0, ans=0.125 2023-10-13 03:24:30,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-10-13 03:24:56,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1264088.0, ans=0.0 2023-10-13 03:26:16,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1264368.0, ans=0.0 2023-10-13 03:26:18,914 INFO [train.py:1031] (3/4) Epoch 20, batch 11500, loss[loss=0.2013, simple_loss=0.3024, pruned_loss=0.05011, over 16818.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2815, pruned_loss=0.04902, over 32704874.07 frames. ], batch size: 87, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 03:26:21,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1264414.6666666667, ans=0.1 2023-10-13 03:26:23,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.873e+02 2.038e+02 2.286e+02 3.197e+02, threshold=4.076e+02, percent-clipped=0.0 2023-10-13 03:26:30,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1264461.3333333333, ans=0.0 2023-10-13 03:26:35,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1264461.3333333333, ans=0.125 2023-10-13 03:26:36,063 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:27:23,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.85 vs. limit=10.0 2023-10-13 03:27:27,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1264648.0, ans=0.2 2023-10-13 03:27:50,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1264741.3333333333, ans=0.125 2023-10-13 03:27:52,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1264741.3333333333, ans=0.1 2023-10-13 03:28:25,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1264881.3333333333, ans=0.125 2023-10-13 03:28:28,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.854e+02 2.159e+02 2.490e+02 6.222e+02, threshold=4.318e+02, percent-clipped=1.0 2023-10-13 03:28:42,403 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.44 vs. limit=15.0 2023-10-13 03:29:22,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1265114.6666666667, ans=10.0 2023-10-13 03:29:28,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1265114.6666666667, ans=0.125 2023-10-13 03:29:40,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.42 vs. limit=15.0 2023-10-13 03:30:04,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2023-10-13 03:30:06,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1265301.3333333333, ans=0.125 2023-10-13 03:30:15,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.06 vs. limit=12.0 2023-10-13 03:30:22,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.804e+02 2.084e+02 2.368e+02 3.797e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-13 03:30:30,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1265394.6666666667, ans=0.125 2023-10-13 03:30:34,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1265394.6666666667, ans=0.02 2023-10-13 03:31:00,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1265488.0, ans=0.035 2023-10-13 03:31:30,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1265581.3333333333, ans=0.0 2023-10-13 03:31:36,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1265628.0, ans=0.125 2023-10-13 03:31:47,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1265628.0, ans=0.125 2023-10-13 03:31:57,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-13 03:32:03,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1265674.6666666667, ans=0.125 2023-10-13 03:32:13,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.71 vs. limit=6.0 2023-10-13 03:32:25,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1265768.0, ans=0.1 2023-10-13 03:32:35,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.814e+02 1.968e+02 2.114e+02 3.008e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-13 03:32:43,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1265861.3333333333, ans=0.1 2023-10-13 03:32:52,659 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:33:38,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1266048.0, ans=0.125 2023-10-13 03:33:46,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1266094.6666666667, ans=0.125 2023-10-13 03:33:55,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1266141.3333333333, ans=0.2 2023-10-13 03:34:16,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1266188.0, ans=0.0 2023-10-13 03:34:38,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1266281.3333333333, ans=0.2 2023-10-13 03:34:42,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.756e+02 1.921e+02 2.146e+02 4.517e+02, threshold=3.842e+02, percent-clipped=1.0 2023-10-13 03:34:43,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1266281.3333333333, ans=0.1 2023-10-13 03:34:46,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1266281.3333333333, ans=0.0 2023-10-13 03:34:57,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1266328.0, ans=0.1 2023-10-13 03:35:09,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1266374.6666666667, ans=0.05 2023-10-13 03:35:24,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1266468.0, ans=0.1 2023-10-13 03:35:37,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1266514.6666666667, ans=0.125 2023-10-13 03:35:38,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1266514.6666666667, ans=0.125 2023-10-13 03:35:47,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1266561.3333333333, ans=22.5 2023-10-13 03:35:52,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1266561.3333333333, ans=0.2 2023-10-13 03:36:15,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-10-13 03:36:16,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1266654.6666666667, ans=0.125 2023-10-13 03:36:19,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1266654.6666666667, ans=0.0 2023-10-13 03:36:26,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.54 vs. limit=10.0 2023-10-13 03:36:33,932 INFO [train.py:1031] (3/4) Epoch 20, batch 12000, loss[loss=0.1723, simple_loss=0.2699, pruned_loss=0.03739, over 16925.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2815, pruned_loss=0.04884, over 32712687.47 frames. ], batch size: 93, lr: 1.71e-03, grad_scale: 32.0 2023-10-13 03:36:40,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.820e+02 2.004e+02 2.272e+02 2.954e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-13 03:36:41,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.78 vs. limit=8.0 2023-10-13 03:36:54,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1266794.6666666667, ans=0.0 2023-10-13 03:36:56,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1266794.6666666667, ans=0.0 2023-10-13 03:37:04,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1266841.3333333333, ans=0.125 2023-10-13 03:37:15,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1266888.0, ans=0.125 2023-10-13 03:37:32,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1266981.3333333333, ans=0.125 2023-10-13 03:37:40,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1266981.3333333333, ans=0.0 2023-10-13 03:37:42,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=22.5 2023-10-13 03:37:47,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1267028.0, ans=0.125 2023-10-13 03:37:57,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1267074.6666666667, ans=0.125 2023-10-13 03:38:09,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1267121.3333333333, ans=0.1 2023-10-13 03:38:17,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1267121.3333333333, ans=0.1 2023-10-13 03:38:26,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1267168.0, ans=0.125 2023-10-13 03:38:32,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=15.0 2023-10-13 03:38:34,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1267214.6666666667, ans=0.125 2023-10-13 03:38:36,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.727e+02 2.014e+02 2.203e+02 3.512e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-13 03:38:44,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1267261.3333333333, ans=0.2 2023-10-13 03:39:02,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.16 vs. limit=22.5 2023-10-13 03:39:07,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1267354.6666666667, ans=0.125 2023-10-13 03:39:20,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1267401.3333333333, ans=0.125 2023-10-13 03:39:25,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-10-13 03:39:33,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1267448.0, ans=0.125 2023-10-13 03:39:44,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1267494.6666666667, ans=0.2 2023-10-13 03:39:46,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1267494.6666666667, ans=0.125 2023-10-13 03:39:49,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1267541.3333333333, ans=0.125 2023-10-13 03:40:28,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.768e+02 1.921e+02 2.102e+02 2.809e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-13 03:40:56,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1267774.6666666667, ans=0.125 2023-10-13 03:41:07,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1267821.3333333333, ans=0.0 2023-10-13 03:41:12,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1267868.0, ans=10.0 2023-10-13 03:41:24,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=22.5 2023-10-13 03:41:28,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1267914.6666666667, ans=0.2 2023-10-13 03:41:37,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1267961.3333333333, ans=0.1 2023-10-13 03:41:42,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1267961.3333333333, ans=12.0 2023-10-13 03:41:52,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1268008.0, ans=0.125 2023-10-13 03:42:03,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1268054.6666666667, ans=0.2 2023-10-13 03:42:19,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1268101.3333333333, ans=0.04949747468305833 2023-10-13 03:42:27,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.780e+02 1.925e+02 2.092e+02 3.220e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-13 03:42:40,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1268194.6666666667, ans=0.0 2023-10-13 03:42:41,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268194.6666666667, ans=0.1 2023-10-13 03:42:52,276 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:43:17,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.19 vs. limit=15.0 2023-10-13 03:43:39,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=22.5 2023-10-13 03:43:55,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268474.6666666667, ans=0.125 2023-10-13 03:44:00,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1268521.3333333333, ans=0.04949747468305833 2023-10-13 03:44:05,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2023-10-13 03:44:10,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1268568.0, ans=0.125 2023-10-13 03:44:25,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.754e+02 1.912e+02 2.107e+02 3.731e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-13 03:44:28,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=22.5 2023-10-13 03:44:40,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268708.0, ans=0.125 2023-10-13 03:44:43,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-10-13 03:44:43,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268708.0, ans=0.125 2023-10-13 03:44:51,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268754.6666666667, ans=0.125 2023-10-13 03:45:06,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-10-13 03:45:14,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.78 vs. limit=22.5 2023-10-13 03:45:18,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1268848.0, ans=0.125 2023-10-13 03:45:21,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1268848.0, ans=0.125 2023-10-13 03:45:24,928 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-10-13 03:45:30,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-10-13 03:45:37,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2023-10-13 03:46:03,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1269034.6666666667, ans=0.125 2023-10-13 03:46:13,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1269081.3333333333, ans=0.2 2023-10-13 03:46:14,566 INFO [train.py:1031] (3/4) Epoch 20, batch 12500, loss[loss=0.1817, simple_loss=0.2794, pruned_loss=0.04206, over 16864.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2812, pruned_loss=0.04888, over 32755373.25 frames. ], batch size: 87, lr: 1.71e-03, grad_scale: 16.0 2023-10-13 03:46:23,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.777e+02 1.884e+02 2.089e+02 2.813e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-13 03:46:23,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1269081.3333333333, ans=0.1 2023-10-13 03:46:28,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1269128.0, ans=0.0 2023-10-13 03:46:32,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1269128.0, ans=0.0 2023-10-13 03:46:38,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=22.5 2023-10-13 03:46:48,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1269221.3333333333, ans=0.0 2023-10-13 03:46:53,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1269221.3333333333, ans=0.125 2023-10-13 03:47:00,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1269268.0, ans=10.0 2023-10-13 03:48:18,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.776e+02 2.009e+02 2.261e+02 3.564e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-13 03:48:22,168 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.02 vs. limit=15.0 2023-10-13 03:48:26,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1269594.6666666667, ans=0.0 2023-10-13 03:48:27,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-13 03:48:31,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1269594.6666666667, ans=0.2 2023-10-13 03:48:38,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-10-13 03:48:59,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1269734.6666666667, ans=0.1 2023-10-13 03:49:09,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-10-13 03:49:10,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.43 vs. limit=15.0 2023-10-13 03:49:11,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1269781.3333333333, ans=0.125 2023-10-13 03:50:01,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-13 03:50:19,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.780e+02 1.934e+02 2.131e+02 2.836e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-13 03:50:50,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1270154.6666666667, ans=0.0 2023-10-13 03:50:57,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1270201.3333333333, ans=0.1 2023-10-13 03:51:04,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-13 03:51:08,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1270201.3333333333, ans=0.0 2023-10-13 03:51:10,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1270248.0, ans=0.0 2023-10-13 03:51:11,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2023-10-13 03:51:21,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1270294.6666666667, ans=0.04949747468305833 2023-10-13 03:51:27,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1270294.6666666667, ans=0.2 2023-10-13 03:51:35,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1270341.3333333333, ans=0.04949747468305833 2023-10-13 03:52:02,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1270434.6666666667, ans=0.0 2023-10-13 03:52:05,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1270434.6666666667, ans=0.2 2023-10-13 03:52:06,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1270434.6666666667, ans=0.125 2023-10-13 03:52:16,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.828e+02 2.012e+02 2.239e+02 3.274e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-13 03:52:21,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1270528.0, ans=0.015 2023-10-13 03:52:27,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1270528.0, ans=0.125 2023-10-13 03:53:19,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1270761.3333333333, ans=0.0 2023-10-13 03:54:03,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1270901.3333333333, ans=0.125 2023-10-13 03:54:07,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1270948.0, ans=0.0 2023-10-13 03:54:16,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.752e+02 1.880e+02 2.181e+02 2.887e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-13 03:54:22,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1270994.6666666667, ans=0.125 2023-10-13 03:54:44,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1271088.0, ans=0.0 2023-10-13 03:55:03,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1271134.6666666667, ans=0.2 2023-10-13 03:55:07,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1271181.3333333333, ans=0.0 2023-10-13 03:55:45,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1271321.3333333333, ans=0.0 2023-10-13 03:56:01,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1271368.0, ans=0.125 2023-10-13 03:56:07,263 INFO [train.py:1031] (3/4) Epoch 20, batch 13000, loss[loss=0.1689, simple_loss=0.265, pruned_loss=0.03641, over 16989.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.282, pruned_loss=0.04911, over 32761492.70 frames. ], batch size: 77, lr: 1.71e-03, grad_scale: 32.0 2023-10-13 03:56:14,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.786e+02 1.965e+02 2.254e+02 3.104e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 03:56:17,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1271461.3333333333, ans=0.2 2023-10-13 03:56:20,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-13 03:56:21,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271461.3333333333, ans=0.1 2023-10-13 03:56:30,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1271461.3333333333, ans=0.125 2023-10-13 03:56:51,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1271554.6666666667, ans=0.04949747468305833 2023-10-13 03:57:01,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1271601.3333333333, ans=0.0 2023-10-13 03:57:01,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-13 03:57:02,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1271601.3333333333, ans=0.09899494936611666 2023-10-13 03:57:12,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1271648.0, ans=0.2 2023-10-13 03:57:13,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1271648.0, ans=0.125 2023-10-13 03:57:16,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1271648.0, ans=0.0 2023-10-13 03:57:22,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1271648.0, ans=0.125 2023-10-13 03:57:23,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1271648.0, ans=0.2 2023-10-13 03:57:37,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-10-13 03:57:45,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1271741.3333333333, ans=0.1 2023-10-13 03:57:49,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1271788.0, ans=0.125 2023-10-13 03:58:01,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1271834.6666666667, ans=0.2 2023-10-13 03:58:13,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1271881.3333333333, ans=0.0 2023-10-13 03:58:17,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1271881.3333333333, ans=0.125 2023-10-13 03:58:22,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.762e+02 1.942e+02 2.238e+02 3.038e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 03:58:33,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1271928.0, ans=0.05 2023-10-13 03:58:34,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1271928.0, ans=0.5 2023-10-13 03:58:53,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272021.3333333333, ans=0.1 2023-10-13 03:58:56,560 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=22.5 2023-10-13 03:59:12,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1272068.0, ans=0.2 2023-10-13 03:59:20,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1272114.6666666667, ans=0.0 2023-10-13 03:59:22,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1272114.6666666667, ans=0.0 2023-10-13 03:59:24,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1272161.3333333333, ans=0.125 2023-10-13 04:00:13,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1272348.0, ans=0.125 2023-10-13 04:00:23,769 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.803e+02 1.963e+02 2.310e+02 3.342e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-13 04:00:51,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-13 04:00:54,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1272488.0, ans=0.0 2023-10-13 04:01:06,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.19 vs. limit=22.5 2023-10-13 04:01:20,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1272581.3333333333, ans=0.125 2023-10-13 04:01:20,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1272581.3333333333, ans=0.2 2023-10-13 04:02:24,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.833e+02 2.095e+02 2.469e+02 3.464e+02, threshold=4.189e+02, percent-clipped=0.0 2023-10-13 04:02:31,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=22.5 2023-10-13 04:02:33,143 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=12.0 2023-10-13 04:02:45,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272908.0, ans=0.1 2023-10-13 04:02:50,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1272954.6666666667, ans=0.125 2023-10-13 04:02:50,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=22.5 2023-10-13 04:03:07,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1273001.3333333333, ans=0.125 2023-10-13 04:03:11,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1273048.0, ans=0.2 2023-10-13 04:03:26,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1273094.6666666667, ans=0.125 2023-10-13 04:03:37,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273141.3333333333, ans=0.1 2023-10-13 04:03:37,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2023-10-13 04:03:41,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273141.3333333333, ans=0.1 2023-10-13 04:04:04,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-10-13 04:04:07,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1273281.3333333333, ans=0.125 2023-10-13 04:04:21,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.808e+02 1.946e+02 2.168e+02 2.800e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 04:04:38,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1273374.6666666667, ans=0.2 2023-10-13 04:04:38,214 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-13 04:04:49,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.72 vs. limit=15.0 2023-10-13 04:04:52,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=12.0 2023-10-13 04:05:21,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-10-13 04:05:22,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1273561.3333333333, ans=0.2 2023-10-13 04:05:57,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1273701.3333333333, ans=0.125 2023-10-13 04:06:06,178 INFO [train.py:1031] (3/4) Epoch 20, batch 13500, loss[loss=0.1962, simple_loss=0.2797, pruned_loss=0.05639, over 16608.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2815, pruned_loss=0.04904, over 32771034.17 frames. ], batch size: 61, lr: 1.71e-03, grad_scale: 16.0 2023-10-13 04:06:15,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.737e+02 1.875e+02 2.025e+02 2.814e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-13 04:06:39,602 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:06:48,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1273888.0, ans=0.125 2023-10-13 04:07:12,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1273981.3333333333, ans=0.0 2023-10-13 04:07:49,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1274121.3333333333, ans=0.125 2023-10-13 04:07:54,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1274168.0, ans=0.1 2023-10-13 04:08:00,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1274168.0, ans=0.0 2023-10-13 04:08:10,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1274214.6666666667, ans=0.125 2023-10-13 04:08:16,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.871e+02 2.015e+02 2.247e+02 3.274e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-13 04:08:20,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1274261.3333333333, ans=0.125 2023-10-13 04:08:21,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1274261.3333333333, ans=0.95 2023-10-13 04:08:27,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-13 04:08:42,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1274354.6666666667, ans=0.125 2023-10-13 04:08:46,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.66 vs. limit=15.0 2023-10-13 04:08:52,562 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:08:53,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1274401.3333333333, ans=0.0 2023-10-13 04:08:53,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1274401.3333333333, ans=10.0 2023-10-13 04:09:36,854 INFO [train.py:1031] (3/4) Epoch 21, batch 0, loss[loss=0.182, simple_loss=0.2716, pruned_loss=0.04621, over 16620.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2716, pruned_loss=0.04621, over 16620.00 frames. ], batch size: 61, lr: 1.67e-03, grad_scale: 32.0 2023-10-13 04:09:36,856 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-13 04:09:46,466 INFO [train.py:1063] (3/4) Epoch 21, validation: loss=0.2147, simple_loss=0.3014, pruned_loss=0.06396, over 1020973.00 frames. 2023-10-13 04:09:46,466 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-13 04:09:59,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1274518.0, ans=0.125 2023-10-13 04:10:22,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1274611.3333333333, ans=0.2 2023-10-13 04:10:40,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1274658.0, ans=0.125 2023-10-13 04:10:41,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1274658.0, ans=10.0 2023-10-13 04:10:46,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-10-13 04:10:48,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-10-13 04:10:52,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.778e+02 1.911e+02 2.109e+02 4.365e+02, threshold=3.822e+02, percent-clipped=1.0 2023-10-13 04:10:53,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1274704.6666666667, ans=0.0 2023-10-13 04:11:26,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1274844.6666666667, ans=0.125 2023-10-13 04:12:25,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1275078.0, ans=0.0 2023-10-13 04:12:25,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1275078.0, ans=0.125 2023-10-13 04:12:32,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1275078.0, ans=0.125 2023-10-13 04:12:34,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1275124.6666666667, ans=0.125 2023-10-13 04:12:49,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-13 04:12:51,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.709e+02 1.856e+02 1.998e+02 2.635e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-13 04:13:08,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1275264.6666666667, ans=0.0 2023-10-13 04:13:09,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1275264.6666666667, ans=0.125 2023-10-13 04:13:36,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1275358.0, ans=0.125 2023-10-13 04:13:43,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1275358.0, ans=0.125 2023-10-13 04:14:02,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=22.5 2023-10-13 04:14:03,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1275451.3333333333, ans=0.125 2023-10-13 04:14:06,493 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:14:42,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1275591.3333333333, ans=0.0 2023-10-13 04:14:53,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.757e+02 1.953e+02 2.230e+02 3.266e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-13 04:14:59,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1275684.6666666667, ans=0.1 2023-10-13 04:15:39,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-10-13 04:15:55,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.33 vs. limit=22.5 2023-10-13 04:16:16,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1275964.6666666667, ans=0.125 2023-10-13 04:16:20,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1275964.6666666667, ans=0.125 2023-10-13 04:16:27,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=12.0 2023-10-13 04:16:41,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1276058.0, ans=0.125 2023-10-13 04:16:44,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1276058.0, ans=0.125 2023-10-13 04:16:52,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1276104.6666666667, ans=0.2 2023-10-13 04:16:53,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.746e+02 1.936e+02 2.165e+02 2.957e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-13 04:17:32,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-10-13 04:17:46,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1276338.0, ans=0.125 2023-10-13 04:17:53,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1276338.0, ans=0.125 2023-10-13 04:18:09,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1276431.3333333333, ans=0.125 2023-10-13 04:18:09,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2023-10-13 04:18:25,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1276478.0, ans=10.0 2023-10-13 04:18:39,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1276524.6666666667, ans=0.1 2023-10-13 04:18:53,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.858e+02 2.031e+02 2.356e+02 2.993e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-13 04:18:58,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1276618.0, ans=0.125 2023-10-13 04:19:03,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1276618.0, ans=0.125 2023-10-13 04:19:04,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1276618.0, ans=0.125 2023-10-13 04:19:21,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1276664.6666666667, ans=0.2 2023-10-13 04:19:29,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1276711.3333333333, ans=0.04949747468305833 2023-10-13 04:19:45,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1276758.0, ans=0.07 2023-10-13 04:19:47,532 INFO [train.py:1031] (3/4) Epoch 21, batch 500, loss[loss=0.1709, simple_loss=0.2686, pruned_loss=0.03663, over 16842.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2825, pruned_loss=0.05044, over 7256250.64 frames. ], batch size: 188, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 04:19:51,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-10-13 04:19:52,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1276804.6666666667, ans=0.125 2023-10-13 04:20:00,524 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-10-13 04:20:09,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1276851.3333333333, ans=0.09899494936611666 2023-10-13 04:20:22,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276944.6666666667, ans=0.1 2023-10-13 04:20:31,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1276944.6666666667, ans=0.125 2023-10-13 04:20:38,482 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:20:50,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1277038.0, ans=0.125 2023-10-13 04:20:53,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.756e+02 1.988e+02 2.304e+02 3.336e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-13 04:20:58,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1277084.6666666667, ans=0.125 2023-10-13 04:21:07,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.98 vs. limit=22.5 2023-10-13 04:21:10,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1277131.3333333333, ans=0.0 2023-10-13 04:21:28,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1277178.0, ans=0.125 2023-10-13 04:21:36,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1277224.6666666667, ans=0.1 2023-10-13 04:21:52,859 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:22:02,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-13 04:22:27,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.72 vs. limit=15.0 2023-10-13 04:22:54,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.875e+02 2.120e+02 2.491e+02 4.231e+02, threshold=4.239e+02, percent-clipped=1.0 2023-10-13 04:22:55,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1277504.6666666667, ans=0.125 2023-10-13 04:23:02,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1277551.3333333333, ans=0.0 2023-10-13 04:23:06,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277551.3333333333, ans=0.1 2023-10-13 04:23:28,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1277644.6666666667, ans=0.2 2023-10-13 04:23:36,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1277691.3333333333, ans=0.125 2023-10-13 04:23:43,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1277691.3333333333, ans=0.05 2023-10-13 04:23:44,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=12.0 2023-10-13 04:23:47,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1277738.0, ans=0.1 2023-10-13 04:23:59,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1277784.6666666667, ans=0.2 2023-10-13 04:24:46,061 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=15.0 2023-10-13 04:24:53,703 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:24:53,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1277971.3333333333, ans=0.125 2023-10-13 04:24:54,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.835e+02 1.986e+02 2.230e+02 3.170e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 04:24:58,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1278018.0, ans=0.2 2023-10-13 04:25:25,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-10-13 04:25:53,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278204.6666666667, ans=0.1 2023-10-13 04:25:53,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1278251.3333333333, ans=0.125 2023-10-13 04:26:23,851 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.90 vs. limit=15.0 2023-10-13 04:26:31,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1278344.6666666667, ans=0.125 2023-10-13 04:26:39,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-10-13 04:26:51,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1278438.0, ans=0.0 2023-10-13 04:26:53,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.732e+02 1.908e+02 2.131e+02 2.938e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-13 04:26:56,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1278484.6666666667, ans=0.125 2023-10-13 04:27:23,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1278578.0, ans=0.125 2023-10-13 04:27:28,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1278578.0, ans=0.125 2023-10-13 04:27:31,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278578.0, ans=0.1 2023-10-13 04:27:31,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-10-13 04:27:32,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1278578.0, ans=0.2 2023-10-13 04:27:33,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278624.6666666667, ans=0.1 2023-10-13 04:27:37,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1278624.6666666667, ans=0.0 2023-10-13 04:27:50,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1278671.3333333333, ans=10.0 2023-10-13 04:27:51,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-10-13 04:27:53,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1278671.3333333333, ans=0.125 2023-10-13 04:27:56,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1278671.3333333333, ans=0.0 2023-10-13 04:28:08,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1278764.6666666667, ans=10.0 2023-10-13 04:28:11,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1278764.6666666667, ans=0.025 2023-10-13 04:28:13,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1278764.6666666667, ans=0.0 2023-10-13 04:28:21,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1278811.3333333333, ans=0.125 2023-10-13 04:28:29,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1278811.3333333333, ans=0.125 2023-10-13 04:28:31,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1278811.3333333333, ans=0.5 2023-10-13 04:28:32,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-10-13 04:28:58,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.776e+02 1.919e+02 2.127e+02 2.738e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-13 04:29:01,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278951.3333333333, ans=0.1 2023-10-13 04:29:11,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1278998.0, ans=0.125 2023-10-13 04:29:15,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1278998.0, ans=0.125 2023-10-13 04:29:30,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1279044.6666666667, ans=0.125 2023-10-13 04:29:37,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-13 04:29:37,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=22.5 2023-10-13 04:29:40,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=15.0 2023-10-13 04:29:43,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1279091.3333333333, ans=0.125 2023-10-13 04:29:47,156 INFO [train.py:1031] (3/4) Epoch 21, batch 1000, loss[loss=0.1993, simple_loss=0.2904, pruned_loss=0.05413, over 16913.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2829, pruned_loss=0.05023, over 12896250.28 frames. ], batch size: 138, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 04:30:08,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1279184.6666666667, ans=0.125 2023-10-13 04:30:52,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.791e+02 1.992e+02 2.270e+02 3.268e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-13 04:31:08,736 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-13 04:31:13,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1279464.6666666667, ans=0.025 2023-10-13 04:31:18,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1279464.6666666667, ans=0.0 2023-10-13 04:31:18,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1279464.6666666667, ans=0.05 2023-10-13 04:31:34,765 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-10-13 04:31:41,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1279558.0, ans=0.125 2023-10-13 04:32:03,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-10-13 04:32:09,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1279651.3333333333, ans=0.05 2023-10-13 04:32:20,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=22.5 2023-10-13 04:32:51,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1279838.0, ans=0.0 2023-10-13 04:32:51,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1279838.0, ans=0.125 2023-10-13 04:32:54,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1279838.0, ans=0.95 2023-10-13 04:32:54,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1279838.0, ans=0.2 2023-10-13 04:33:01,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.739e+02 1.878e+02 2.081e+02 3.347e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-13 04:33:03,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1279838.0, ans=0.125 2023-10-13 04:33:19,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1279884.6666666667, ans=0.1 2023-10-13 04:33:51,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1280024.6666666667, ans=0.125 2023-10-13 04:34:18,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1280118.0, ans=0.125 2023-10-13 04:34:21,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1280118.0, ans=0.0 2023-10-13 04:34:22,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1280118.0, ans=0.125 2023-10-13 04:34:33,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1280164.6666666667, ans=0.2 2023-10-13 04:35:07,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.01 vs. limit=12.0 2023-10-13 04:35:07,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.671e+02 1.804e+02 1.970e+02 2.415e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-13 04:35:10,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1280351.3333333333, ans=0.0 2023-10-13 04:35:12,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1280351.3333333333, ans=0.1 2023-10-13 04:35:31,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1280444.6666666667, ans=10.0 2023-10-13 04:35:32,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1280444.6666666667, ans=0.0 2023-10-13 04:35:39,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1280444.6666666667, ans=0.125 2023-10-13 04:35:51,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1280491.3333333333, ans=0.125 2023-10-13 04:36:00,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280538.0, ans=0.1 2023-10-13 04:36:18,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.20 vs. limit=15.0 2023-10-13 04:36:48,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1280724.6666666667, ans=0.125 2023-10-13 04:36:59,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1280771.3333333333, ans=10.0 2023-10-13 04:37:04,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.753e+02 1.910e+02 2.196e+02 3.250e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-13 04:37:14,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-10-13 04:37:44,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1280958.0, ans=0.125 2023-10-13 04:37:47,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1280958.0, ans=0.0 2023-10-13 04:37:56,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1281004.6666666667, ans=0.0 2023-10-13 04:38:09,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.70 vs. limit=15.0 2023-10-13 04:38:16,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1281051.3333333333, ans=0.125 2023-10-13 04:38:36,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1281144.6666666667, ans=0.0 2023-10-13 04:38:39,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.44 vs. limit=15.0 2023-10-13 04:38:46,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=15.0 2023-10-13 04:38:57,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1281191.3333333333, ans=0.125 2023-10-13 04:38:59,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1281238.0, ans=0.0 2023-10-13 04:39:08,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.727e+02 1.845e+02 2.055e+02 2.595e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-13 04:39:13,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-10-13 04:39:47,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1281378.0, ans=0.125 2023-10-13 04:39:58,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1281424.6666666667, ans=0.125 2023-10-13 04:39:59,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2023-10-13 04:40:03,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1281471.3333333333, ans=0.1 2023-10-13 04:40:04,134 INFO [train.py:1031] (3/4) Epoch 21, batch 1500, loss[loss=0.179, simple_loss=0.2763, pruned_loss=0.04084, over 16941.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.281, pruned_loss=0.04922, over 17314119.94 frames. ], batch size: 104, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 04:40:06,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1281471.3333333333, ans=0.2 2023-10-13 04:40:16,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1281518.0, ans=0.125 2023-10-13 04:40:25,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1281564.6666666667, ans=0.0 2023-10-13 04:40:37,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1281564.6666666667, ans=0.125 2023-10-13 04:40:43,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1281611.3333333333, ans=0.95 2023-10-13 04:40:43,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1281611.3333333333, ans=0.125 2023-10-13 04:40:45,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1281611.3333333333, ans=0.125 2023-10-13 04:40:58,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1281658.0, ans=0.125 2023-10-13 04:40:59,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1281658.0, ans=0.125 2023-10-13 04:41:16,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1281704.6666666667, ans=0.125 2023-10-13 04:41:18,299 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-13 04:41:18,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.727e+02 1.896e+02 2.100e+02 2.801e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-13 04:41:31,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1281798.0, ans=0.0 2023-10-13 04:41:56,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1281891.3333333333, ans=0.0 2023-10-13 04:41:58,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1281891.3333333333, ans=0.125 2023-10-13 04:42:00,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1281891.3333333333, ans=0.125 2023-10-13 04:42:01,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1281891.3333333333, ans=0.125 2023-10-13 04:42:02,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1281891.3333333333, ans=0.025 2023-10-13 04:42:19,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1281938.0, ans=0.125 2023-10-13 04:42:49,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1282078.0, ans=0.2 2023-10-13 04:43:24,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-13 04:43:32,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.785e+02 1.977e+02 2.237e+02 3.526e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-13 04:44:49,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1282498.0, ans=0.125 2023-10-13 04:44:53,215 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-13 04:44:58,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1282498.0, ans=0.125 2023-10-13 04:45:07,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1282544.6666666667, ans=0.125 2023-10-13 04:45:15,260 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.98 vs. limit=6.0 2023-10-13 04:45:17,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1282591.3333333333, ans=0.125 2023-10-13 04:45:33,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-10-13 04:45:34,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.306e+02 1.761e+02 1.901e+02 2.067e+02 2.761e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-13 04:45:37,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.24 vs. limit=15.0 2023-10-13 04:46:03,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1282731.3333333333, ans=10.0 2023-10-13 04:46:11,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-10-13 04:46:39,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1282871.3333333333, ans=0.125 2023-10-13 04:46:53,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1282918.0, ans=0.0 2023-10-13 04:46:54,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1282918.0, ans=0.0 2023-10-13 04:47:01,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1282964.6666666667, ans=0.0 2023-10-13 04:47:02,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1282964.6666666667, ans=0.125 2023-10-13 04:47:32,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1283058.0, ans=0.025 2023-10-13 04:47:46,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-10-13 04:47:50,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.738e+02 1.895e+02 2.074e+02 3.093e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-13 04:47:58,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1283151.3333333333, ans=0.125 2023-10-13 04:47:59,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.77 vs. limit=22.5 2023-10-13 04:48:01,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1283151.3333333333, ans=0.04949747468305833 2023-10-13 04:48:01,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1283151.3333333333, ans=0.07 2023-10-13 04:48:53,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1283338.0, ans=0.125 2023-10-13 04:48:54,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1283338.0, ans=0.125 2023-10-13 04:49:38,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-10-13 04:49:46,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1283524.6666666667, ans=0.125 2023-10-13 04:50:10,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.712e+02 1.899e+02 2.083e+02 2.784e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-13 04:50:14,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1283618.0, ans=0.125 2023-10-13 04:50:38,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-10-13 04:51:03,912 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:51:05,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.38 vs. limit=15.0 2023-10-13 04:51:15,262 INFO [train.py:1031] (3/4) Epoch 21, batch 2000, loss[loss=0.1821, simple_loss=0.2798, pruned_loss=0.04223, over 16312.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2815, pruned_loss=0.04917, over 20756916.32 frames. ], batch size: 50, lr: 1.66e-03, grad_scale: 32.0 2023-10-13 04:51:27,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1283851.3333333333, ans=0.07 2023-10-13 04:51:28,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-10-13 04:52:09,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1283944.6666666667, ans=0.0 2023-10-13 04:52:20,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1283991.3333333333, ans=0.125 2023-10-13 04:52:27,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1283991.3333333333, ans=0.09899494936611666 2023-10-13 04:52:30,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284038.0, ans=0.1 2023-10-13 04:52:37,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1284038.0, ans=0.0 2023-10-13 04:52:41,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.736e+02 1.861e+02 2.041e+02 3.044e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-13 04:52:49,100 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:53:07,425 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:53:44,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1284271.3333333333, ans=0.035 2023-10-13 04:53:48,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1284271.3333333333, ans=0.0 2023-10-13 04:53:58,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1284318.0, ans=0.125 2023-10-13 04:54:28,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1284364.6666666667, ans=0.125 2023-10-13 04:54:44,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284411.3333333333, ans=0.0 2023-10-13 04:55:24,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.764e+02 2.072e+02 2.386e+02 3.198e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-13 04:55:39,707 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:55:48,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1284598.0, ans=0.125 2023-10-13 04:56:20,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1284691.3333333333, ans=0.0 2023-10-13 04:56:20,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.57 vs. limit=10.0 2023-10-13 04:56:47,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1284784.6666666667, ans=0.0 2023-10-13 04:57:01,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-10-13 04:57:08,182 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:57:36,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1284971.3333333333, ans=0.0 2023-10-13 04:57:36,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.57 vs. limit=10.0 2023-10-13 04:57:40,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.833e+02 1.992e+02 2.230e+02 2.711e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-13 04:57:51,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1285018.0, ans=0.0 2023-10-13 04:58:15,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1285111.3333333333, ans=6.0 2023-10-13 04:58:44,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1285204.6666666667, ans=0.125 2023-10-13 04:58:54,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1285251.3333333333, ans=0.125 2023-10-13 04:58:54,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=22.5 2023-10-13 04:59:15,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1285298.0, ans=0.125 2023-10-13 04:59:18,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1285344.6666666667, ans=0.125 2023-10-13 04:59:26,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1285344.6666666667, ans=0.1 2023-10-13 04:59:28,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1285344.6666666667, ans=0.125 2023-10-13 04:59:39,876 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:59:41,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1285391.3333333333, ans=0.2 2023-10-13 04:59:52,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.811e+02 1.973e+02 2.186e+02 2.951e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-13 05:00:10,575 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-10-13 05:00:13,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1285531.3333333333, ans=0.035 2023-10-13 05:00:37,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1285578.0, ans=0.0 2023-10-13 05:01:02,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1285671.3333333333, ans=0.2 2023-10-13 05:01:24,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1285764.6666666667, ans=0.125 2023-10-13 05:01:39,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1285811.3333333333, ans=0.0 2023-10-13 05:01:41,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1285811.3333333333, ans=0.125 2023-10-13 05:01:53,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1285858.0, ans=0.0 2023-10-13 05:02:01,531 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.23 vs. limit=15.0 2023-10-13 05:02:11,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.778e+02 1.905e+02 2.097e+02 3.671e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-13 05:02:22,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1285951.3333333333, ans=0.125 2023-10-13 05:02:24,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1285951.3333333333, ans=0.125 2023-10-13 05:02:32,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1285998.0, ans=0.0 2023-10-13 05:02:33,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1285998.0, ans=0.125 2023-10-13 05:02:54,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1286091.3333333333, ans=0.125 2023-10-13 05:03:05,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-10-13 05:03:06,274 INFO [train.py:1031] (3/4) Epoch 21, batch 2500, loss[loss=0.1905, simple_loss=0.282, pruned_loss=0.04953, over 16908.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2818, pruned_loss=0.04922, over 23447805.16 frames. ], batch size: 110, lr: 1.66e-03, grad_scale: 32.0 2023-10-13 05:03:12,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1286138.0, ans=0.125 2023-10-13 05:03:21,582 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-13 05:03:43,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1286231.3333333333, ans=0.2 2023-10-13 05:03:52,827 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:03:52,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1286278.0, ans=0.2 2023-10-13 05:03:56,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286278.0, ans=0.1 2023-10-13 05:03:56,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1286278.0, ans=0.0 2023-10-13 05:04:11,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1286324.6666666667, ans=0.0 2023-10-13 05:04:17,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=1286371.3333333333, ans=22.5 2023-10-13 05:04:24,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.787e+02 1.900e+02 2.151e+02 3.018e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-13 05:04:34,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.46 vs. limit=22.5 2023-10-13 05:04:46,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1286464.6666666667, ans=0.2 2023-10-13 05:05:13,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.52 vs. limit=22.5 2023-10-13 05:05:16,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=22.5 2023-10-13 05:05:26,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1286604.6666666667, ans=0.025 2023-10-13 05:05:32,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1286651.3333333333, ans=0.125 2023-10-13 05:05:41,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1286651.3333333333, ans=0.0 2023-10-13 05:06:00,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1286744.6666666667, ans=0.125 2023-10-13 05:06:09,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1286791.3333333333, ans=0.0 2023-10-13 05:06:14,865 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=15.0 2023-10-13 05:06:19,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1286791.3333333333, ans=0.125 2023-10-13 05:06:31,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286838.0, ans=0.1 2023-10-13 05:06:32,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.765e+02 1.934e+02 2.156e+02 2.781e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-13 05:06:37,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286884.6666666667, ans=0.1 2023-10-13 05:06:58,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.03 vs. limit=10.0 2023-10-13 05:07:00,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1286978.0, ans=0.125 2023-10-13 05:07:07,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1286978.0, ans=0.0 2023-10-13 05:07:10,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1287024.6666666667, ans=0.05 2023-10-13 05:07:24,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-10-13 05:07:25,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1287071.3333333333, ans=0.0 2023-10-13 05:07:27,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1287071.3333333333, ans=0.09899494936611666 2023-10-13 05:07:42,189 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-10-13 05:08:14,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1287211.3333333333, ans=0.02 2023-10-13 05:08:15,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.33 vs. limit=10.0 2023-10-13 05:08:17,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1287211.3333333333, ans=15.0 2023-10-13 05:08:29,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1287258.0, ans=0.0 2023-10-13 05:08:52,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.760e+02 1.935e+02 2.149e+02 3.036e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-13 05:09:09,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.84 vs. limit=10.0 2023-10-13 05:09:37,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1287444.6666666667, ans=0.125 2023-10-13 05:09:39,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1287444.6666666667, ans=0.2 2023-10-13 05:09:42,529 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:09:52,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1287491.3333333333, ans=0.125 2023-10-13 05:10:01,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1287538.0, ans=0.125 2023-10-13 05:11:24,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.807e+02 1.980e+02 2.185e+02 3.348e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-13 05:11:38,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1287818.0, ans=0.125 2023-10-13 05:11:46,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1287864.6666666667, ans=0.0 2023-10-13 05:11:50,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=12.0 2023-10-13 05:12:02,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1287911.3333333333, ans=0.125 2023-10-13 05:12:31,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-13 05:12:35,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1288004.6666666667, ans=0.0 2023-10-13 05:12:38,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1288004.6666666667, ans=0.2 2023-10-13 05:12:50,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1288051.3333333333, ans=0.125 2023-10-13 05:12:50,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.10 vs. limit=15.0 2023-10-13 05:13:16,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=8.0 2023-10-13 05:13:29,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.61 vs. limit=22.5 2023-10-13 05:13:56,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1288284.6666666667, ans=0.0 2023-10-13 05:13:57,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.738e+02 1.887e+02 2.157e+02 2.867e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-13 05:14:44,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.99 vs. limit=15.0 2023-10-13 05:14:49,895 INFO [train.py:1031] (3/4) Epoch 21, batch 3000, loss[loss=0.1937, simple_loss=0.2863, pruned_loss=0.05057, over 16945.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2809, pruned_loss=0.04913, over 25524616.90 frames. ], batch size: 165, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 05:14:55,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1288471.3333333333, ans=0.5 2023-10-13 05:15:43,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1288658.0, ans=0.2 2023-10-13 05:15:56,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1288704.6666666667, ans=0.0 2023-10-13 05:15:58,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1288704.6666666667, ans=0.125 2023-10-13 05:16:06,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1288704.6666666667, ans=0.2 2023-10-13 05:16:09,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.818e+02 2.001e+02 2.173e+02 3.215e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-13 05:16:20,701 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-10-13 05:16:34,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-13 05:16:43,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1288844.6666666667, ans=0.125 2023-10-13 05:17:16,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-10-13 05:18:09,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1289124.6666666667, ans=0.125 2023-10-13 05:18:24,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.783e+02 1.987e+02 2.190e+02 2.624e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-13 05:18:25,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.90 vs. limit=15.0 2023-10-13 05:18:49,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1289264.6666666667, ans=0.1 2023-10-13 05:18:54,112 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=15.0 2023-10-13 05:18:55,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1289311.3333333333, ans=0.2 2023-10-13 05:19:07,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1289311.3333333333, ans=10.0 2023-10-13 05:19:11,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1289358.0, ans=0.125 2023-10-13 05:19:25,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1289404.6666666667, ans=0.0 2023-10-13 05:20:06,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1289544.6666666667, ans=0.0 2023-10-13 05:20:08,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1289544.6666666667, ans=0.125 2023-10-13 05:20:36,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.67 vs. limit=15.0 2023-10-13 05:21:00,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.786e+02 1.918e+02 2.092e+02 3.242e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-13 05:21:24,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-10-13 05:21:28,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.43 vs. limit=15.0 2023-10-13 05:21:32,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1289778.0, ans=0.07 2023-10-13 05:22:05,511 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=12.0 2023-10-13 05:22:22,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1289918.0, ans=0.0 2023-10-13 05:23:32,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.783e+02 1.952e+02 2.161e+02 3.717e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-13 05:23:50,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1290198.0, ans=0.2 2023-10-13 05:24:14,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1290291.3333333333, ans=0.125 2023-10-13 05:25:21,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1290524.6666666667, ans=0.0 2023-10-13 05:25:37,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1290571.3333333333, ans=0.0 2023-10-13 05:25:49,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.773e+02 1.995e+02 2.272e+02 3.151e+02, threshold=3.990e+02, percent-clipped=0.0 2023-10-13 05:25:58,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1290618.0, ans=0.125 2023-10-13 05:26:08,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1290664.6666666667, ans=0.125 2023-10-13 05:26:19,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1290711.3333333333, ans=0.125 2023-10-13 05:26:26,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1290711.3333333333, ans=0.2 2023-10-13 05:26:39,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1290758.0, ans=0.0 2023-10-13 05:26:44,835 INFO [train.py:1031] (3/4) Epoch 21, batch 3500, loss[loss=0.2102, simple_loss=0.3007, pruned_loss=0.05988, over 16916.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2809, pruned_loss=0.04924, over 27134282.35 frames. ], batch size: 110, lr: 1.66e-03, grad_scale: 32.0 2023-10-13 05:26:50,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-10-13 05:27:33,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1290944.6666666667, ans=0.0 2023-10-13 05:27:37,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.44 vs. limit=15.0 2023-10-13 05:27:40,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1290991.3333333333, ans=0.0 2023-10-13 05:27:59,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1291084.6666666667, ans=0.1 2023-10-13 05:28:00,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.754e+02 1.916e+02 2.165e+02 2.867e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-13 05:28:58,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1291224.6666666667, ans=0.0 2023-10-13 05:29:14,199 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-10-13 05:29:17,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1291271.3333333333, ans=0.125 2023-10-13 05:29:27,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-10-13 05:29:59,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1291411.3333333333, ans=0.2 2023-10-13 05:30:05,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1291411.3333333333, ans=0.125 2023-10-13 05:30:06,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1291411.3333333333, ans=0.2 2023-10-13 05:30:07,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1291411.3333333333, ans=0.1 2023-10-13 05:30:12,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1291458.0, ans=0.09899494936611666 2023-10-13 05:30:38,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1291504.6666666667, ans=0.2 2023-10-13 05:30:43,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.740e+02 1.880e+02 2.019e+02 3.193e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-13 05:31:02,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1291598.0, ans=0.125 2023-10-13 05:31:32,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1291691.3333333333, ans=0.0 2023-10-13 05:32:03,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1291784.6666666667, ans=0.0 2023-10-13 05:32:39,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1291878.0, ans=0.2 2023-10-13 05:32:52,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1291878.0, ans=0.09899494936611666 2023-10-13 05:33:17,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1291971.3333333333, ans=0.1 2023-10-13 05:33:22,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.710e+02 1.951e+02 2.172e+02 3.159e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 05:33:29,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1292018.0, ans=0.0 2023-10-13 05:33:51,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.93 vs. limit=15.0 2023-10-13 05:34:07,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1292158.0, ans=0.125 2023-10-13 05:34:16,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1292204.6666666667, ans=0.0 2023-10-13 05:34:46,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1292298.0, ans=0.1 2023-10-13 05:34:53,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1292298.0, ans=0.125 2023-10-13 05:34:58,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1292344.6666666667, ans=0.1 2023-10-13 05:35:10,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1292391.3333333333, ans=0.2 2023-10-13 05:35:11,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.71 vs. limit=15.0 2023-10-13 05:35:38,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.773e+02 1.929e+02 2.163e+02 3.187e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-13 05:35:49,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1292484.6666666667, ans=0.0 2023-10-13 05:36:17,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-10-13 05:36:30,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1292671.3333333333, ans=0.1 2023-10-13 05:36:33,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2023-10-13 05:36:51,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1292718.0, ans=0.125 2023-10-13 05:36:54,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1292718.0, ans=0.2 2023-10-13 05:37:23,802 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.87 vs. limit=10.0 2023-10-13 05:37:47,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.755e+02 1.939e+02 2.226e+02 3.420e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-13 05:37:48,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1292951.3333333333, ans=0.2 2023-10-13 05:38:18,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1293044.6666666667, ans=0.1 2023-10-13 05:38:38,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.69 vs. limit=10.0 2023-10-13 05:38:43,472 INFO [train.py:1031] (3/4) Epoch 21, batch 4000, loss[loss=0.2199, simple_loss=0.3071, pruned_loss=0.06641, over 15710.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2806, pruned_loss=0.04928, over 28394729.69 frames. ], batch size: 350, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 05:39:02,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1293184.6666666667, ans=0.0 2023-10-13 05:39:03,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1293184.6666666667, ans=0.125 2023-10-13 05:39:06,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1293184.6666666667, ans=0.0 2023-10-13 05:39:12,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1293231.3333333333, ans=0.125 2023-10-13 05:39:24,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293278.0, ans=0.125 2023-10-13 05:39:47,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1293324.6666666667, ans=0.125 2023-10-13 05:39:58,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=15.0 2023-10-13 05:40:00,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1293371.3333333333, ans=0.2 2023-10-13 05:40:07,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1293418.0, ans=0.125 2023-10-13 05:40:10,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1293418.0, ans=0.0 2023-10-13 05:40:10,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.804e+02 1.922e+02 2.196e+02 3.017e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-13 05:40:18,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1293418.0, ans=0.0 2023-10-13 05:40:30,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1293464.6666666667, ans=0.125 2023-10-13 05:40:31,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1293464.6666666667, ans=0.125 2023-10-13 05:40:32,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-10-13 05:41:50,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1293744.6666666667, ans=0.125 2023-10-13 05:42:05,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1293791.3333333333, ans=0.04949747468305833 2023-10-13 05:42:09,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1293791.3333333333, ans=0.125 2023-10-13 05:42:10,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1293791.3333333333, ans=0.1 2023-10-13 05:42:16,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1293838.0, ans=0.0 2023-10-13 05:42:34,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1293884.6666666667, ans=12.0 2023-10-13 05:42:36,769 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.826e+02 2.030e+02 2.272e+02 3.180e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-13 05:43:46,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1294071.3333333333, ans=0.015 2023-10-13 05:43:51,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-10-13 05:43:54,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1294118.0, ans=0.125 2023-10-13 05:44:00,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1294118.0, ans=0.125 2023-10-13 05:44:29,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1294211.3333333333, ans=0.1 2023-10-13 05:44:45,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1294258.0, ans=0.1 2023-10-13 05:44:45,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1294258.0, ans=0.125 2023-10-13 05:44:58,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1294304.6666666667, ans=0.125 2023-10-13 05:45:03,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.92 vs. limit=15.0 2023-10-13 05:45:08,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.744e+02 1.978e+02 2.232e+02 3.307e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 05:45:19,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1294398.0, ans=0.09899494936611666 2023-10-13 05:45:20,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1294398.0, ans=0.125 2023-10-13 05:45:24,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1294398.0, ans=0.125 2023-10-13 05:45:28,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=22.5 2023-10-13 05:45:34,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.95 vs. limit=15.0 2023-10-13 05:45:42,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1294491.3333333333, ans=0.0 2023-10-13 05:45:44,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1294491.3333333333, ans=0.125 2023-10-13 05:46:02,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1294538.0, ans=0.2 2023-10-13 05:46:04,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2023-10-13 05:46:19,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1294584.6666666667, ans=0.125 2023-10-13 05:46:24,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-10-13 05:46:26,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-10-13 05:46:28,581 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:46:46,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1294678.0, ans=0.0 2023-10-13 05:46:58,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1294724.6666666667, ans=0.2 2023-10-13 05:47:20,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.881e+02 2.068e+02 2.321e+02 3.179e+02, threshold=4.137e+02, percent-clipped=0.0 2023-10-13 05:47:32,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1294864.6666666667, ans=0.0 2023-10-13 05:47:35,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1294864.6666666667, ans=0.125 2023-10-13 05:47:42,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1294864.6666666667, ans=0.125 2023-10-13 05:47:51,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.83 vs. limit=10.0 2023-10-13 05:48:06,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1294958.0, ans=0.05 2023-10-13 05:48:10,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.73 vs. limit=15.0 2023-10-13 05:48:30,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1295051.3333333333, ans=0.09899494936611666 2023-10-13 05:48:38,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1295098.0, ans=0.1 2023-10-13 05:48:39,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1295098.0, ans=0.125 2023-10-13 05:48:48,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1295098.0, ans=0.0 2023-10-13 05:49:08,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1295144.6666666667, ans=0.125 2023-10-13 05:49:20,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1295191.3333333333, ans=0.125 2023-10-13 05:49:36,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1295238.0, ans=10.0 2023-10-13 05:49:41,671 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:49:46,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.867e+02 2.052e+02 2.160e+02 3.021e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-13 05:49:47,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=12.0 2023-10-13 05:50:44,202 INFO [train.py:1031] (3/4) Epoch 21, batch 4500, loss[loss=0.2249, simple_loss=0.2947, pruned_loss=0.07757, over 15716.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2809, pruned_loss=0.04906, over 29380017.10 frames. ], batch size: 350, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 05:50:51,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1295471.3333333333, ans=0.125 2023-10-13 05:50:59,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1295518.0, ans=0.125 2023-10-13 05:51:24,545 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:51:41,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1295611.3333333333, ans=0.04949747468305833 2023-10-13 05:51:41,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1295611.3333333333, ans=0.125 2023-10-13 05:51:50,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2023-10-13 05:51:51,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1295658.0, ans=0.2 2023-10-13 05:51:58,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1295704.6666666667, ans=0.125 2023-10-13 05:52:18,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.737e+02 1.943e+02 2.144e+02 2.579e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 05:52:22,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295751.3333333333, ans=0.1 2023-10-13 05:52:41,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-10-13 05:52:53,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1295891.3333333333, ans=0.125 2023-10-13 05:52:54,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1295891.3333333333, ans=0.95 2023-10-13 05:53:01,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1295891.3333333333, ans=0.0 2023-10-13 05:53:06,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1295938.0, ans=0.125 2023-10-13 05:53:08,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1295938.0, ans=0.0 2023-10-13 05:53:17,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295984.6666666667, ans=0.1 2023-10-13 05:53:46,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1296078.0, ans=0.125 2023-10-13 05:53:52,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1296078.0, ans=0.1 2023-10-13 05:54:07,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1296124.6666666667, ans=0.125 2023-10-13 05:54:24,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=12.0 2023-10-13 05:54:27,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.795e+02 1.940e+02 2.155e+02 2.996e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 05:54:28,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1296218.0, ans=0.125 2023-10-13 05:54:28,082 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:54:31,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-13 05:54:45,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1296264.6666666667, ans=0.0 2023-10-13 05:54:46,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296264.6666666667, ans=0.1 2023-10-13 05:55:39,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1296451.3333333333, ans=0.125 2023-10-13 05:55:44,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1296451.3333333333, ans=0.125 2023-10-13 05:55:50,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1296498.0, ans=0.2 2023-10-13 05:55:50,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1296498.0, ans=0.125 2023-10-13 05:56:07,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1296544.6666666667, ans=10.0 2023-10-13 05:56:09,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-10-13 05:56:18,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1296544.6666666667, ans=0.0 2023-10-13 05:56:29,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1296591.3333333333, ans=0.2 2023-10-13 05:56:50,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.808e+02 1.932e+02 2.125e+02 2.753e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 05:57:02,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1296731.3333333333, ans=0.125 2023-10-13 05:57:16,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1296778.0, ans=0.125 2023-10-13 05:57:27,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.81 vs. limit=22.5 2023-10-13 05:57:46,578 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=12.0 2023-10-13 05:57:49,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1296918.0, ans=0.125 2023-10-13 05:58:35,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1297011.3333333333, ans=0.125 2023-10-13 05:59:08,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.739e+02 1.898e+02 2.103e+02 2.651e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-13 05:59:17,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1297151.3333333333, ans=0.0 2023-10-13 05:59:31,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1297198.0, ans=0.0 2023-10-13 06:00:11,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1297338.0, ans=0.0 2023-10-13 06:00:19,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-10-13 06:00:27,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1297384.6666666667, ans=0.0 2023-10-13 06:00:46,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1297431.3333333333, ans=0.0 2023-10-13 06:00:53,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1297478.0, ans=0.05 2023-10-13 06:01:05,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=1297524.6666666667, ans=15.0 2023-10-13 06:01:19,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-13 06:01:38,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1297618.0, ans=0.1 2023-10-13 06:01:42,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.754e+02 1.946e+02 2.192e+02 2.891e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 06:01:58,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1297664.6666666667, ans=0.0 2023-10-13 06:02:03,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1297664.6666666667, ans=0.1 2023-10-13 06:02:36,575 INFO [train.py:1031] (3/4) Epoch 21, batch 5000, loss[loss=0.1837, simple_loss=0.278, pruned_loss=0.04466, over 16835.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2806, pruned_loss=0.0491, over 30156606.18 frames. ], batch size: 116, lr: 1.65e-03, grad_scale: 16.0 2023-10-13 06:03:29,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1297944.6666666667, ans=0.1 2023-10-13 06:03:34,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1297944.6666666667, ans=0.2 2023-10-13 06:04:10,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=22.5 2023-10-13 06:04:19,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1298084.6666666667, ans=0.125 2023-10-13 06:04:23,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1298084.6666666667, ans=0.125 2023-10-13 06:04:25,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.823e+02 2.046e+02 2.302e+02 3.322e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-13 06:04:40,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.09 vs. limit=22.5 2023-10-13 06:04:57,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1298178.0, ans=0.2 2023-10-13 06:05:20,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1298224.6666666667, ans=0.1 2023-10-13 06:05:34,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1298271.3333333333, ans=0.1 2023-10-13 06:06:34,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-13 06:06:49,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-10-13 06:06:58,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1298504.6666666667, ans=0.1 2023-10-13 06:06:58,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1298504.6666666667, ans=0.09899494936611666 2023-10-13 06:07:00,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.27 vs. limit=6.0 2023-10-13 06:07:11,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1298551.3333333333, ans=0.1 2023-10-13 06:07:12,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.795e+02 2.030e+02 2.250e+02 3.958e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-13 06:07:22,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.97 vs. limit=22.5 2023-10-13 06:07:31,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.60 vs. limit=6.0 2023-10-13 06:07:37,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1298644.6666666667, ans=0.5 2023-10-13 06:08:29,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1298784.6666666667, ans=0.0 2023-10-13 06:08:35,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.67 vs. limit=12.0 2023-10-13 06:09:18,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1298924.6666666667, ans=0.0 2023-10-13 06:09:42,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1299018.0, ans=0.2 2023-10-13 06:09:43,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1299018.0, ans=0.1 2023-10-13 06:09:43,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1299018.0, ans=0.125 2023-10-13 06:09:46,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.777e+02 1.996e+02 2.257e+02 2.950e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 06:10:03,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1299064.6666666667, ans=0.2 2023-10-13 06:10:44,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1299204.6666666667, ans=0.125 2023-10-13 06:11:04,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1299251.3333333333, ans=0.2 2023-10-13 06:11:11,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1299298.0, ans=0.1 2023-10-13 06:11:27,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1299344.6666666667, ans=0.0 2023-10-13 06:12:15,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1299484.6666666667, ans=0.125 2023-10-13 06:12:24,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.663e+02 1.793e+02 1.987e+02 2.969e+02, threshold=3.586e+02, percent-clipped=0.0 2023-10-13 06:12:45,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299578.0, ans=0.1 2023-10-13 06:12:46,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-13 06:13:12,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1299624.6666666667, ans=0.125 2023-10-13 06:13:14,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299624.6666666667, ans=0.1 2023-10-13 06:13:15,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1299624.6666666667, ans=0.125 2023-10-13 06:13:59,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.76 vs. limit=15.0 2023-10-13 06:14:00,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1299764.6666666667, ans=0.125 2023-10-13 06:14:24,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1299858.0, ans=0.125 2023-10-13 06:14:36,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299858.0, ans=0.1 2023-10-13 06:14:40,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299904.6666666667, ans=0.1 2023-10-13 06:15:07,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.775e+02 1.923e+02 2.083e+02 2.868e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 06:15:18,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1299998.0, ans=0.125 2023-10-13 06:15:28,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1300044.6666666667, ans=15.0 2023-10-13 06:15:55,190 INFO [train.py:1031] (3/4) Epoch 21, batch 5500, loss[loss=0.1728, simple_loss=0.2655, pruned_loss=0.04004, over 16411.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2806, pruned_loss=0.04906, over 30738662.52 frames. ], batch size: 50, lr: 1.65e-03, grad_scale: 8.0 2023-10-13 06:16:06,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.87 vs. limit=10.0 2023-10-13 06:16:06,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-10-13 06:16:51,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1300324.6666666667, ans=0.125 2023-10-13 06:17:03,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=15.0 2023-10-13 06:17:26,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.746e+02 1.896e+02 2.183e+02 3.374e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-13 06:17:42,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1300464.6666666667, ans=0.125 2023-10-13 06:17:55,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1300511.3333333333, ans=0.0 2023-10-13 06:17:56,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1300511.3333333333, ans=0.0 2023-10-13 06:18:00,048 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:18:11,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1300558.0, ans=0.125 2023-10-13 06:18:21,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1300604.6666666667, ans=22.5 2023-10-13 06:19:25,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1300791.3333333333, ans=0.0 2023-10-13 06:19:27,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1300791.3333333333, ans=0.0 2023-10-13 06:19:58,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-10-13 06:20:03,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.760e+02 1.913e+02 2.199e+02 3.182e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-13 06:20:12,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1300931.3333333333, ans=0.125 2023-10-13 06:20:13,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.86 vs. limit=22.5 2023-10-13 06:20:17,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1300931.3333333333, ans=0.0 2023-10-13 06:20:22,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1300931.3333333333, ans=0.07 2023-10-13 06:20:33,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1300978.0, ans=0.125 2023-10-13 06:20:45,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1301024.6666666667, ans=0.125 2023-10-13 06:20:54,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1301024.6666666667, ans=0.0 2023-10-13 06:21:17,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1301118.0, ans=10.0 2023-10-13 06:21:25,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=22.5 2023-10-13 06:21:58,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.84 vs. limit=15.0 2023-10-13 06:22:17,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1301258.0, ans=0.5 2023-10-13 06:22:30,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1301304.6666666667, ans=0.0 2023-10-13 06:22:40,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.817e+02 1.978e+02 2.194e+02 2.878e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 06:22:56,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1301398.0, ans=0.0 2023-10-13 06:22:56,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.53 vs. limit=10.0 2023-10-13 06:23:01,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1301444.6666666667, ans=0.0 2023-10-13 06:23:09,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1301444.6666666667, ans=0.0 2023-10-13 06:23:35,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1301538.0, ans=0.125 2023-10-13 06:23:37,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1301538.0, ans=0.125 2023-10-13 06:23:51,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.22 vs. limit=15.0 2023-10-13 06:24:01,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1301631.3333333333, ans=0.125 2023-10-13 06:24:44,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1301724.6666666667, ans=0.1 2023-10-13 06:25:14,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1301818.0, ans=0.125 2023-10-13 06:25:21,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.713e+02 1.903e+02 2.182e+02 3.667e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-13 06:26:26,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1302004.6666666667, ans=0.0 2023-10-13 06:26:27,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1302004.6666666667, ans=0.125 2023-10-13 06:27:05,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.17 vs. limit=15.0 2023-10-13 06:27:23,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1302191.3333333333, ans=0.125 2023-10-13 06:27:33,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1302191.3333333333, ans=0.1 2023-10-13 06:28:02,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1302284.6666666667, ans=0.1 2023-10-13 06:28:02,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.810e+02 2.076e+02 2.283e+02 3.214e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-13 06:28:16,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1302331.3333333333, ans=0.0 2023-10-13 06:28:29,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1302378.0, ans=0.2 2023-10-13 06:28:59,774 INFO [train.py:1031] (3/4) Epoch 21, batch 6000, loss[loss=0.1903, simple_loss=0.284, pruned_loss=0.04836, over 16882.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2807, pruned_loss=0.04934, over 31151172.62 frames. ], batch size: 165, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 06:29:01,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1302471.3333333333, ans=0.2 2023-10-13 06:29:37,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1302564.6666666667, ans=0.125 2023-10-13 06:30:46,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.819e+02 1.986e+02 2.196e+02 3.283e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 06:31:06,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1302798.0, ans=0.1 2023-10-13 06:31:48,642 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:32:04,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1302984.6666666667, ans=0.0 2023-10-13 06:32:41,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.79 vs. limit=15.0 2023-10-13 06:33:06,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1303124.6666666667, ans=0.1 2023-10-13 06:33:11,269 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:33:12,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2023-10-13 06:33:51,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.815e+02 1.942e+02 2.124e+02 2.826e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 06:33:52,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1303218.0, ans=0.125 2023-10-13 06:34:02,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1303264.6666666667, ans=0.1 2023-10-13 06:34:20,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1303311.3333333333, ans=0.125 2023-10-13 06:34:20,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-10-13 06:34:45,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-10-13 06:35:11,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1303404.6666666667, ans=0.0 2023-10-13 06:35:18,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=15.0 2023-10-13 06:35:25,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1303451.3333333333, ans=0.125 2023-10-13 06:35:34,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-10-13 06:35:43,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1303498.0, ans=0.125 2023-10-13 06:35:54,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-13 06:35:59,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1303544.6666666667, ans=0.125 2023-10-13 06:36:09,708 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:36:41,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1303638.0, ans=0.0 2023-10-13 06:36:41,300 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:36:43,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1303684.6666666667, ans=0.125 2023-10-13 06:36:50,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1303684.6666666667, ans=0.07 2023-10-13 06:36:51,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1303684.6666666667, ans=0.125 2023-10-13 06:36:58,353 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:37:00,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.815e+02 2.072e+02 2.385e+02 3.282e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-13 06:37:57,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1303824.6666666667, ans=0.0 2023-10-13 06:38:06,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1303824.6666666667, ans=0.125 2023-10-13 06:38:12,910 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:38:15,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-10-13 06:38:24,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1303871.3333333333, ans=0.125 2023-10-13 06:38:40,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1303918.0, ans=0.0 2023-10-13 06:39:49,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1304104.6666666667, ans=0.1 2023-10-13 06:39:50,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2023-10-13 06:40:10,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.03 vs. limit=15.0 2023-10-13 06:40:16,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1304151.3333333333, ans=0.07 2023-10-13 06:40:23,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.844e+02 1.968e+02 2.249e+02 3.506e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-13 06:40:28,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1304198.0, ans=0.125 2023-10-13 06:40:38,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1304198.0, ans=15.0 2023-10-13 06:40:38,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.34 vs. limit=15.0 2023-10-13 06:40:47,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-10-13 06:41:22,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1304291.3333333333, ans=0.125 2023-10-13 06:42:17,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1304431.3333333333, ans=0.125 2023-10-13 06:43:32,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.734e+02 1.914e+02 2.144e+02 2.780e+02, threshold=3.829e+02, percent-clipped=0.0 2023-10-13 06:44:03,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1304664.6666666667, ans=0.1 2023-10-13 06:44:37,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1304758.0, ans=0.125 2023-10-13 06:44:48,420 INFO [train.py:1031] (3/4) Epoch 21, batch 6500, loss[loss=0.1968, simple_loss=0.2959, pruned_loss=0.04881, over 16925.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2813, pruned_loss=0.04959, over 31491957.60 frames. ], batch size: 165, lr: 1.65e-03, grad_scale: 16.0 2023-10-13 06:44:52,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1304804.6666666667, ans=0.05 2023-10-13 06:46:51,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1304991.3333333333, ans=0.125 2023-10-13 06:46:54,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1305038.0, ans=0.125 2023-10-13 06:47:29,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.811e+02 2.008e+02 2.196e+02 3.105e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-13 06:48:07,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.12 vs. limit=15.0 2023-10-13 06:48:16,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.18 vs. limit=12.0 2023-10-13 06:48:42,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1305271.3333333333, ans=0.125 2023-10-13 06:49:40,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1305364.6666666667, ans=0.04949747468305833 2023-10-13 06:49:40,580 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-13 06:50:10,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1305458.0, ans=0.125 2023-10-13 06:50:20,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.86 vs. limit=22.5 2023-10-13 06:51:00,060 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-10-13 06:51:00,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1305551.3333333333, ans=0.2 2023-10-13 06:51:08,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1305551.3333333333, ans=0.07 2023-10-13 06:51:11,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.776e+02 1.935e+02 2.243e+02 2.884e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-13 06:51:19,164 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:51:30,364 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:52:16,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1305691.3333333333, ans=0.2 2023-10-13 06:52:34,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1305784.6666666667, ans=0.0 2023-10-13 06:52:37,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-10-13 06:52:41,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1305784.6666666667, ans=0.125 2023-10-13 06:52:41,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1305784.6666666667, ans=0.125 2023-10-13 06:53:42,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1305924.6666666667, ans=0.125 2023-10-13 06:54:14,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1305971.3333333333, ans=0.125 2023-10-13 06:54:21,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1306018.0, ans=15.0 2023-10-13 06:54:33,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.771e+02 1.941e+02 2.216e+02 2.930e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-13 06:54:38,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1306064.6666666667, ans=0.0 2023-10-13 06:54:55,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1306064.6666666667, ans=0.125 2023-10-13 06:55:00,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1306111.3333333333, ans=10.0 2023-10-13 06:55:36,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1306204.6666666667, ans=0.2 2023-10-13 06:55:38,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1306204.6666666667, ans=0.125 2023-10-13 06:56:00,134 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.75 vs. limit=12.0 2023-10-13 06:56:04,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1306251.3333333333, ans=0.0 2023-10-13 06:56:07,175 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-10-13 06:56:33,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.95 vs. limit=15.0 2023-10-13 06:57:23,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1306438.0, ans=0.0 2023-10-13 06:57:29,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1306438.0, ans=0.125 2023-10-13 06:57:35,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1306438.0, ans=0.05 2023-10-13 06:57:57,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.703e+02 1.862e+02 2.052e+02 3.179e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-13 06:59:05,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1306578.0, ans=0.125 2023-10-13 07:00:14,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1306718.0, ans=0.0 2023-10-13 07:01:39,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1306858.0, ans=0.0 2023-10-13 07:02:03,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1306904.6666666667, ans=0.0 2023-10-13 07:02:03,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1306904.6666666667, ans=0.5 2023-10-13 07:02:31,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.802e+02 1.962e+02 2.271e+02 2.953e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-13 07:02:48,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1306998.0, ans=0.125 2023-10-13 07:02:48,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1306998.0, ans=0.1 2023-10-13 07:03:03,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1307044.6666666667, ans=0.1 2023-10-13 07:03:04,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1307044.6666666667, ans=0.035 2023-10-13 07:03:08,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1307044.6666666667, ans=0.015 2023-10-13 07:03:13,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307044.6666666667, ans=0.1 2023-10-13 07:03:40,110 INFO [train.py:1031] (3/4) Epoch 21, batch 7000, loss[loss=0.194, simple_loss=0.2884, pruned_loss=0.04982, over 16501.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2819, pruned_loss=0.04945, over 31817868.62 frames. ], batch size: 50, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 07:04:55,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-10-13 07:05:06,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-10-13 07:06:03,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307418.0, ans=0.1 2023-10-13 07:06:22,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.764e+02 1.976e+02 2.167e+02 3.433e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-13 07:06:31,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1307464.6666666667, ans=0.125 2023-10-13 07:06:34,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1307464.6666666667, ans=10.0 2023-10-13 07:07:03,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307511.3333333333, ans=0.1 2023-10-13 07:07:35,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1307558.0, ans=0.0 2023-10-13 07:07:35,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1307558.0, ans=0.0 2023-10-13 07:07:51,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1307604.6666666667, ans=0.1 2023-10-13 07:07:53,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1307604.6666666667, ans=0.125 2023-10-13 07:08:25,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1307698.0, ans=0.2 2023-10-13 07:08:48,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1307744.6666666667, ans=0.0 2023-10-13 07:08:48,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1307744.6666666667, ans=0.125 2023-10-13 07:09:00,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1307791.3333333333, ans=0.0 2023-10-13 07:09:29,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-10-13 07:09:44,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.860e+02 2.024e+02 2.276e+02 3.097e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-13 07:09:49,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-10-13 07:10:38,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1308071.3333333333, ans=0.125 2023-10-13 07:10:40,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1308118.0, ans=0.125 2023-10-13 07:10:57,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1308164.6666666667, ans=0.125 2023-10-13 07:11:00,275 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:11:25,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1308258.0, ans=0.0 2023-10-13 07:11:36,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1308304.6666666667, ans=0.07 2023-10-13 07:11:42,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1308304.6666666667, ans=0.0 2023-10-13 07:11:43,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1308304.6666666667, ans=0.035 2023-10-13 07:11:54,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1308351.3333333333, ans=0.2 2023-10-13 07:12:01,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.735e+02 1.954e+02 2.252e+02 3.130e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-13 07:12:02,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1308398.0, ans=0.0 2023-10-13 07:12:33,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1308491.3333333333, ans=0.125 2023-10-13 07:12:42,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1308538.0, ans=0.125 2023-10-13 07:12:56,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1308584.6666666667, ans=0.0 2023-10-13 07:12:56,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1308584.6666666667, ans=0.0 2023-10-13 07:14:06,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.757e+02 1.928e+02 2.175e+02 2.976e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-13 07:14:08,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1308864.6666666667, ans=0.2 2023-10-13 07:14:58,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1309051.3333333333, ans=0.0 2023-10-13 07:15:00,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1309051.3333333333, ans=0.125 2023-10-13 07:15:03,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1309098.0, ans=0.1 2023-10-13 07:15:05,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1309098.0, ans=0.07 2023-10-13 07:15:06,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1309098.0, ans=0.125 2023-10-13 07:15:13,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-10-13 07:15:14,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-10-13 07:15:30,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309191.3333333333, ans=0.125 2023-10-13 07:15:33,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1309191.3333333333, ans=0.0 2023-10-13 07:15:41,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1309238.0, ans=0.0 2023-10-13 07:15:49,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1309284.6666666667, ans=0.125 2023-10-13 07:15:55,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1309284.6666666667, ans=0.2 2023-10-13 07:15:57,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.802e+02 1.981e+02 2.180e+02 3.151e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-13 07:16:06,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=15.0 2023-10-13 07:16:23,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1309424.6666666667, ans=0.0 2023-10-13 07:16:26,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1309424.6666666667, ans=0.0 2023-10-13 07:16:32,216 INFO [train.py:1031] (3/4) Epoch 21, batch 7500, loss[loss=0.2485, simple_loss=0.3113, pruned_loss=0.0929, over 15642.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.282, pruned_loss=0.04968, over 32019942.62 frames. ], batch size: 350, lr: 1.64e-03, grad_scale: 16.0 2023-10-13 07:16:45,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1309518.0, ans=0.07 2023-10-13 07:16:53,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1309564.6666666667, ans=10.0 2023-10-13 07:16:55,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1309564.6666666667, ans=0.125 2023-10-13 07:17:05,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1309611.3333333333, ans=0.2 2023-10-13 07:17:08,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1309611.3333333333, ans=0.1 2023-10-13 07:17:17,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309658.0, ans=0.125 2023-10-13 07:17:45,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1309751.3333333333, ans=0.5 2023-10-13 07:17:46,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.11 vs. limit=22.5 2023-10-13 07:17:50,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.766e+02 1.921e+02 2.088e+02 2.962e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-13 07:18:00,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1309844.6666666667, ans=0.1 2023-10-13 07:18:10,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1309891.3333333333, ans=0.125 2023-10-13 07:18:28,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1309938.0, ans=0.125 2023-10-13 07:19:05,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1310078.0, ans=0.1 2023-10-13 07:19:18,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1310124.6666666667, ans=0.0 2023-10-13 07:19:24,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1310124.6666666667, ans=0.125 2023-10-13 07:19:33,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1310171.3333333333, ans=0.0 2023-10-13 07:19:39,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1310171.3333333333, ans=0.125 2023-10-13 07:19:57,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.727e+02 1.909e+02 2.192e+02 2.863e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-13 07:20:04,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1310264.6666666667, ans=0.0 2023-10-13 07:20:06,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1310264.6666666667, ans=0.0 2023-10-13 07:20:39,544 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:20:49,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-10-13 07:21:15,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=15.0 2023-10-13 07:21:39,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1310638.0, ans=0.2 2023-10-13 07:21:55,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.732e+02 1.923e+02 2.032e+02 2.652e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 07:21:55,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1310731.3333333333, ans=0.125 2023-10-13 07:22:02,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1310731.3333333333, ans=0.0 2023-10-13 07:22:03,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310778.0, ans=0.1 2023-10-13 07:22:34,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1310871.3333333333, ans=0.05 2023-10-13 07:22:49,843 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-10-13 07:22:55,549 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.09 vs. limit=22.5 2023-10-13 07:23:02,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-10-13 07:23:21,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1311058.0, ans=0.125 2023-10-13 07:23:53,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.773e+02 1.982e+02 2.214e+02 2.993e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 07:23:54,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1311198.0, ans=0.2 2023-10-13 07:23:56,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1311198.0, ans=0.1 2023-10-13 07:24:23,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1311338.0, ans=0.125 2023-10-13 07:24:31,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1311338.0, ans=0.0 2023-10-13 07:24:41,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-13 07:25:26,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1311571.3333333333, ans=0.125 2023-10-13 07:25:40,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1311618.0, ans=0.125 2023-10-13 07:25:48,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1311664.6666666667, ans=10.0 2023-10-13 07:25:51,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.675e+02 1.765e+02 1.925e+02 2.697e+02, threshold=3.531e+02, percent-clipped=0.0 2023-10-13 07:26:18,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1311758.0, ans=0.125 2023-10-13 07:26:22,917 INFO [train.py:1031] (3/4) Epoch 21, batch 8000, loss[loss=0.1805, simple_loss=0.2732, pruned_loss=0.04395, over 16831.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2814, pruned_loss=0.04912, over 32201494.38 frames. ], batch size: 175, lr: 1.64e-03, grad_scale: 16.0 2023-10-13 07:26:48,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1311898.0, ans=0.125 2023-10-13 07:27:10,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1311991.3333333333, ans=0.0 2023-10-13 07:27:16,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1312038.0, ans=0.125 2023-10-13 07:27:17,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1312038.0, ans=0.125 2023-10-13 07:27:26,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1312038.0, ans=0.05 2023-10-13 07:27:41,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.690e+02 1.891e+02 2.152e+02 3.196e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-13 07:28:03,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1312224.6666666667, ans=0.0 2023-10-13 07:28:09,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1312224.6666666667, ans=0.125 2023-10-13 07:28:13,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1312271.3333333333, ans=0.125 2023-10-13 07:28:38,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1312364.6666666667, ans=0.5 2023-10-13 07:28:44,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1312411.3333333333, ans=0.0 2023-10-13 07:29:22,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1312504.6666666667, ans=0.0 2023-10-13 07:29:29,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1312551.3333333333, ans=0.125 2023-10-13 07:29:34,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1312551.3333333333, ans=0.125 2023-10-13 07:29:46,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.781e+02 1.990e+02 2.138e+02 2.951e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-13 07:29:47,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312598.0, ans=0.1 2023-10-13 07:29:48,450 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:29:51,626 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.69 vs. limit=15.0 2023-10-13 07:30:13,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1312691.3333333333, ans=0.125 2023-10-13 07:30:16,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1312691.3333333333, ans=0.125 2023-10-13 07:30:37,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1312784.6666666667, ans=0.125 2023-10-13 07:30:40,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1312784.6666666667, ans=0.0 2023-10-13 07:31:04,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=12.0 2023-10-13 07:31:06,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1312924.6666666667, ans=0.125 2023-10-13 07:31:14,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1312924.6666666667, ans=0.2 2023-10-13 07:31:17,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.76 vs. limit=15.0 2023-10-13 07:31:24,863 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-10-13 07:31:35,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1313018.0, ans=0.125 2023-10-13 07:31:40,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.749e+02 1.975e+02 2.410e+02 3.621e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-13 07:31:44,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1313064.6666666667, ans=10.0 2023-10-13 07:31:50,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1313111.3333333333, ans=0.125 2023-10-13 07:31:54,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-10-13 07:31:57,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1313111.3333333333, ans=0.2 2023-10-13 07:32:10,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-13 07:32:14,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1313204.6666666667, ans=0.0 2023-10-13 07:32:24,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1313204.6666666667, ans=0.0 2023-10-13 07:32:47,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1313298.0, ans=0.125 2023-10-13 07:33:26,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1313484.6666666667, ans=0.0 2023-10-13 07:33:36,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.898e+02 2.062e+02 2.388e+02 2.998e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-13 07:33:50,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-13 07:33:51,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1313578.0, ans=0.07 2023-10-13 07:33:58,824 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.02 vs. limit=15.0 2023-10-13 07:34:03,323 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.65 vs. limit=10.0 2023-10-13 07:34:17,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1313671.3333333333, ans=0.125 2023-10-13 07:34:37,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1313764.6666666667, ans=0.2 2023-10-13 07:34:42,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1313764.6666666667, ans=0.125 2023-10-13 07:34:50,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1313811.3333333333, ans=0.125 2023-10-13 07:34:53,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1313811.3333333333, ans=0.2 2023-10-13 07:35:01,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1313858.0, ans=0.0 2023-10-13 07:35:08,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1313904.6666666667, ans=0.0 2023-10-13 07:35:11,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1313904.6666666667, ans=0.125 2023-10-13 07:35:14,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-13 07:35:16,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.48 vs. limit=15.0 2023-10-13 07:35:31,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.823e+02 1.977e+02 2.190e+02 3.159e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-13 07:35:39,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314044.6666666667, ans=0.1 2023-10-13 07:36:08,435 INFO [train.py:1031] (3/4) Epoch 21, batch 8500, loss[loss=0.1736, simple_loss=0.2657, pruned_loss=0.04072, over 16386.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2816, pruned_loss=0.04888, over 32363628.47 frames. ], batch size: 50, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 07:36:13,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1314138.0, ans=0.1 2023-10-13 07:36:15,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1314138.0, ans=0.125 2023-10-13 07:36:19,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1314184.6666666667, ans=0.2 2023-10-13 07:36:37,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1314231.3333333333, ans=0.95 2023-10-13 07:36:56,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1314324.6666666667, ans=0.0 2023-10-13 07:37:17,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1314418.0, ans=0.125 2023-10-13 07:37:24,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314418.0, ans=0.1 2023-10-13 07:37:31,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.788e+02 1.945e+02 2.120e+02 2.906e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 07:37:44,872 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:38:18,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314604.6666666667, ans=0.1 2023-10-13 07:38:21,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1314604.6666666667, ans=0.125 2023-10-13 07:38:26,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1314651.3333333333, ans=0.125 2023-10-13 07:38:33,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1314698.0, ans=0.125 2023-10-13 07:38:38,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314698.0, ans=0.1 2023-10-13 07:38:39,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1314698.0, ans=0.125 2023-10-13 07:38:55,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1314744.6666666667, ans=0.0 2023-10-13 07:39:01,067 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:39:03,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1314791.3333333333, ans=0.125 2023-10-13 07:39:03,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1314791.3333333333, ans=0.125 2023-10-13 07:39:05,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-13 07:39:36,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1314931.3333333333, ans=0.0 2023-10-13 07:39:37,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.784e+02 1.977e+02 2.317e+02 3.173e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 07:40:12,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1315071.3333333333, ans=0.125 2023-10-13 07:40:12,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1315071.3333333333, ans=0.0 2023-10-13 07:40:53,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-13 07:40:54,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1315211.3333333333, ans=0.125 2023-10-13 07:41:09,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-13 07:41:22,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1315351.3333333333, ans=0.0 2023-10-13 07:41:26,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1315351.3333333333, ans=0.1 2023-10-13 07:41:39,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1315398.0, ans=0.2 2023-10-13 07:41:39,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.660e+02 1.792e+02 1.997e+02 2.662e+02, threshold=3.584e+02, percent-clipped=0.0 2023-10-13 07:42:26,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1315584.6666666667, ans=0.2 2023-10-13 07:42:39,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1315631.3333333333, ans=0.125 2023-10-13 07:42:50,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1315678.0, ans=0.2 2023-10-13 07:42:53,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1315678.0, ans=0.125 2023-10-13 07:43:06,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1315771.3333333333, ans=0.125 2023-10-13 07:43:09,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1315771.3333333333, ans=0.0 2023-10-13 07:43:10,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1315771.3333333333, ans=0.125 2023-10-13 07:43:14,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1315771.3333333333, ans=0.2 2023-10-13 07:43:29,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.691e+02 1.902e+02 2.166e+02 3.080e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-13 07:43:32,710 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:44:00,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.64 vs. limit=10.0 2023-10-13 07:44:12,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=22.5 2023-10-13 07:44:22,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1316098.0, ans=0.2 2023-10-13 07:44:27,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1316098.0, ans=0.2 2023-10-13 07:44:30,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1316098.0, ans=0.2 2023-10-13 07:44:32,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1316144.6666666667, ans=0.07 2023-10-13 07:44:42,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1316144.6666666667, ans=0.125 2023-10-13 07:44:47,715 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:44:47,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1316191.3333333333, ans=0.0 2023-10-13 07:44:49,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1316191.3333333333, ans=0.125 2023-10-13 07:45:03,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=22.5 2023-10-13 07:45:06,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1316284.6666666667, ans=0.125 2023-10-13 07:45:19,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.777e+02 1.942e+02 2.146e+02 3.297e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 07:45:32,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.57 vs. limit=22.5 2023-10-13 07:45:49,495 INFO [train.py:1031] (3/4) Epoch 21, batch 9000, loss[loss=0.1823, simple_loss=0.2735, pruned_loss=0.04556, over 16357.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2808, pruned_loss=0.04857, over 32459013.36 frames. ], batch size: 50, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 07:46:01,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.33 vs. limit=22.5 2023-10-13 07:46:12,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1316564.6666666667, ans=0.2 2023-10-13 07:46:13,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1316564.6666666667, ans=0.5 2023-10-13 07:46:15,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-10-13 07:46:15,873 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.13 vs. limit=10.0 2023-10-13 07:47:04,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1316798.0, ans=0.125 2023-10-13 07:47:07,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.761e+02 1.924e+02 2.106e+02 2.641e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 07:47:07,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2023-10-13 07:47:23,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1316844.6666666667, ans=0.04949747468305833 2023-10-13 07:47:27,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-10-13 07:48:07,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=22.5 2023-10-13 07:48:14,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1317078.0, ans=0.05 2023-10-13 07:48:24,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-10-13 07:48:36,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1317171.3333333333, ans=0.04949747468305833 2023-10-13 07:48:50,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1317264.6666666667, ans=0.125 2023-10-13 07:48:53,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.815e+02 1.967e+02 2.149e+02 2.760e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-13 07:48:54,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1317264.6666666667, ans=0.125 2023-10-13 07:49:10,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1317311.3333333333, ans=0.0 2023-10-13 07:49:26,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1317404.6666666667, ans=0.1 2023-10-13 07:49:26,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317404.6666666667, ans=0.1 2023-10-13 07:49:29,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1317404.6666666667, ans=0.0 2023-10-13 07:49:30,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-10-13 07:49:33,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1317451.3333333333, ans=0.0 2023-10-13 07:49:46,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1317498.0, ans=0.2 2023-10-13 07:49:47,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317498.0, ans=0.1 2023-10-13 07:49:47,589 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.25 vs. limit=5.0 2023-10-13 07:49:53,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1317498.0, ans=0.125 2023-10-13 07:49:58,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-10-13 07:50:17,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.94 vs. limit=15.0 2023-10-13 07:50:37,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.36 vs. limit=15.0 2023-10-13 07:50:37,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.808e+02 1.948e+02 2.163e+02 3.903e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-13 07:50:52,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1317778.0, ans=0.125 2023-10-13 07:50:57,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1317824.6666666667, ans=0.0 2023-10-13 07:51:02,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1317824.6666666667, ans=0.125 2023-10-13 07:51:04,190 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:51:19,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1317918.0, ans=0.125 2023-10-13 07:51:53,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1318058.0, ans=0.2 2023-10-13 07:51:54,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.63 vs. limit=22.5 2023-10-13 07:51:58,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1318058.0, ans=0.0 2023-10-13 07:52:35,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.827e+02 1.968e+02 2.182e+02 3.020e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-13 07:52:41,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1318198.0, ans=0.125 2023-10-13 07:52:50,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1318244.6666666667, ans=0.125 2023-10-13 07:52:52,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1318244.6666666667, ans=0.0 2023-10-13 07:52:54,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1318291.3333333333, ans=0.0 2023-10-13 07:53:02,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1318291.3333333333, ans=0.0 2023-10-13 07:53:08,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318338.0, ans=0.1 2023-10-13 07:53:45,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1318478.0, ans=0.0 2023-10-13 07:53:51,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1318524.6666666667, ans=0.0 2023-10-13 07:54:00,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1318524.6666666667, ans=0.0 2023-10-13 07:54:07,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.11 vs. limit=22.5 2023-10-13 07:54:18,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1318618.0, ans=0.125 2023-10-13 07:54:23,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1318618.0, ans=0.015 2023-10-13 07:54:31,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.841e+02 1.999e+02 2.301e+02 3.065e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-13 07:54:56,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1318758.0, ans=0.0 2023-10-13 07:54:58,595 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-10-13 07:55:05,984 INFO [train.py:1031] (3/4) Epoch 21, batch 9500, loss[loss=0.1977, simple_loss=0.2922, pruned_loss=0.05159, over 16859.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2816, pruned_loss=0.04892, over 32519415.04 frames. ], batch size: 130, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 07:55:11,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.17 vs. limit=22.5 2023-10-13 07:55:21,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1318851.3333333333, ans=0.125 2023-10-13 07:55:31,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1318898.0, ans=0.5 2023-10-13 07:55:44,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1318944.6666666667, ans=0.0 2023-10-13 07:55:49,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1318991.3333333333, ans=0.0 2023-10-13 07:56:04,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1319038.0, ans=0.1 2023-10-13 07:56:16,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1319084.6666666667, ans=0.125 2023-10-13 07:56:26,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.750e+02 1.942e+02 2.191e+02 2.759e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 07:56:32,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1319131.3333333333, ans=0.025 2023-10-13 07:56:35,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1319178.0, ans=0.5 2023-10-13 07:56:36,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1319178.0, ans=0.125 2023-10-13 07:56:40,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1319178.0, ans=0.125 2023-10-13 07:56:40,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1319178.0, ans=0.0 2023-10-13 07:56:44,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1319178.0, ans=0.125 2023-10-13 07:56:46,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1319224.6666666667, ans=0.0 2023-10-13 07:56:46,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1319224.6666666667, ans=0.0 2023-10-13 07:57:13,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1319318.0, ans=0.2 2023-10-13 07:57:28,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1319364.6666666667, ans=0.125 2023-10-13 07:57:36,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1319411.3333333333, ans=0.0 2023-10-13 07:57:40,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1319411.3333333333, ans=0.125 2023-10-13 07:57:44,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1319458.0, ans=0.125 2023-10-13 07:57:45,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1319458.0, ans=0.0 2023-10-13 07:58:21,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.766e+02 1.916e+02 2.137e+02 2.778e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-13 07:58:37,342 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:58:41,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1319691.3333333333, ans=0.0 2023-10-13 07:58:49,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1319691.3333333333, ans=0.125 2023-10-13 07:58:50,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1319691.3333333333, ans=0.0 2023-10-13 07:58:59,354 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:59:07,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1319784.6666666667, ans=0.125 2023-10-13 07:59:24,993 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:59:29,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1319878.0, ans=0.125 2023-10-13 07:59:44,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1319924.6666666667, ans=0.035 2023-10-13 08:00:10,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1320064.6666666667, ans=0.0 2023-10-13 08:00:11,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.767e+02 1.911e+02 2.117e+02 3.188e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 08:00:21,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1320064.6666666667, ans=0.125 2023-10-13 08:00:27,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320111.3333333333, ans=0.1 2023-10-13 08:00:33,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1320111.3333333333, ans=0.125 2023-10-13 08:00:54,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1320204.6666666667, ans=0.125 2023-10-13 08:01:04,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1320251.3333333333, ans=0.0 2023-10-13 08:01:09,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2023-10-13 08:01:14,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1320298.0, ans=0.125 2023-10-13 08:01:24,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2023-10-13 08:01:30,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1320391.3333333333, ans=0.0 2023-10-13 08:02:01,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-10-13 08:02:04,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1320531.3333333333, ans=0.2 2023-10-13 08:02:05,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.740e+02 1.884e+02 2.038e+02 2.676e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-13 08:02:06,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1320531.3333333333, ans=0.0 2023-10-13 08:02:17,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1320578.0, ans=0.125 2023-10-13 08:02:45,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1320671.3333333333, ans=0.0 2023-10-13 08:02:47,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1320671.3333333333, ans=0.2 2023-10-13 08:02:59,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1320718.0, ans=0.125 2023-10-13 08:03:00,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1320764.6666666667, ans=0.2 2023-10-13 08:03:07,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1320764.6666666667, ans=0.0 2023-10-13 08:03:08,762 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=3.216e-02 2023-10-13 08:03:11,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2023-10-13 08:03:25,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1320858.0, ans=0.0 2023-10-13 08:03:42,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1320951.3333333333, ans=0.125 2023-10-13 08:03:55,293 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.729e+02 1.864e+02 2.107e+02 2.605e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-13 08:04:02,144 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:04:24,694 INFO [train.py:1031] (3/4) Epoch 21, batch 10000, loss[loss=0.1726, simple_loss=0.2695, pruned_loss=0.03784, over 16887.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2808, pruned_loss=0.04872, over 32552653.23 frames. ], batch size: 104, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 08:04:35,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1321184.6666666667, ans=0.125 2023-10-13 08:04:57,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1321278.0, ans=0.125 2023-10-13 08:05:23,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.23 vs. limit=22.5 2023-10-13 08:05:30,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1321418.0, ans=0.125 2023-10-13 08:05:33,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321418.0, ans=0.1 2023-10-13 08:05:35,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=15.0 2023-10-13 08:05:37,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1321418.0, ans=0.2 2023-10-13 08:05:40,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1321464.6666666667, ans=0.125 2023-10-13 08:05:44,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.900e+02 2.109e+02 2.308e+02 3.153e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-13 08:05:49,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-10-13 08:06:05,965 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:06:12,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.77 vs. limit=10.0 2023-10-13 08:06:25,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-13 08:06:25,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=22.5 2023-10-13 08:06:59,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1321744.6666666667, ans=0.125 2023-10-13 08:07:27,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.72 vs. limit=15.0 2023-10-13 08:07:34,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1321931.3333333333, ans=0.125 2023-10-13 08:07:37,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.771e+02 1.900e+02 2.072e+02 2.870e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-13 08:07:43,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1321931.3333333333, ans=0.0 2023-10-13 08:07:53,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.14 vs. limit=22.5 2023-10-13 08:08:05,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1322024.6666666667, ans=0.125 2023-10-13 08:08:18,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1322071.3333333333, ans=0.125 2023-10-13 08:08:35,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1322164.6666666667, ans=0.125 2023-10-13 08:08:59,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1322258.0, ans=0.125 2023-10-13 08:09:00,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.36 vs. limit=15.0 2023-10-13 08:09:02,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1322258.0, ans=0.2 2023-10-13 08:09:14,568 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:09:32,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1322398.0, ans=0.0 2023-10-13 08:09:35,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.818e+02 1.973e+02 2.168e+02 2.968e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 08:09:36,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1322398.0, ans=0.0 2023-10-13 08:09:45,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1322444.6666666667, ans=10.0 2023-10-13 08:09:46,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1322444.6666666667, ans=0.0 2023-10-13 08:09:48,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1322444.6666666667, ans=0.1 2023-10-13 08:09:56,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1322491.3333333333, ans=0.125 2023-10-13 08:09:59,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1322491.3333333333, ans=10.0 2023-10-13 08:10:15,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1322538.0, ans=0.2 2023-10-13 08:10:18,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1322538.0, ans=0.125 2023-10-13 08:10:34,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1322631.3333333333, ans=0.0 2023-10-13 08:10:51,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322724.6666666667, ans=0.1 2023-10-13 08:11:09,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-13 08:11:19,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.10 vs. limit=12.0 2023-10-13 08:11:30,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.735e+02 1.843e+02 2.021e+02 2.917e+02, threshold=3.686e+02, percent-clipped=0.0 2023-10-13 08:11:36,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1322864.6666666667, ans=0.0 2023-10-13 08:11:55,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1322958.0, ans=0.125 2023-10-13 08:12:05,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1323004.6666666667, ans=0.125 2023-10-13 08:12:12,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1323004.6666666667, ans=0.125 2023-10-13 08:12:22,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1323051.3333333333, ans=0.125 2023-10-13 08:12:23,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.59 vs. limit=22.5 2023-10-13 08:12:28,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1323098.0, ans=0.125 2023-10-13 08:12:42,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1323144.6666666667, ans=0.125 2023-10-13 08:13:03,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1323238.0, ans=0.0 2023-10-13 08:13:04,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1323238.0, ans=0.125 2023-10-13 08:13:22,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0 2023-10-13 08:13:28,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1323331.3333333333, ans=0.125 2023-10-13 08:13:30,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.714e+02 1.860e+02 2.070e+02 2.864e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-13 08:13:32,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1323331.3333333333, ans=0.1 2023-10-13 08:13:43,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1323378.0, ans=10.0 2023-10-13 08:13:45,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.34 vs. limit=10.0 2023-10-13 08:13:50,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1323424.6666666667, ans=0.0 2023-10-13 08:13:56,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1323424.6666666667, ans=0.0 2023-10-13 08:13:58,682 INFO [train.py:1031] (3/4) Epoch 21, batch 10500, loss[loss=0.1894, simple_loss=0.2823, pruned_loss=0.04829, over 16894.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2814, pruned_loss=0.04901, over 32589790.86 frames. ], batch size: 116, lr: 1.64e-03, grad_scale: 16.0 2023-10-13 08:14:12,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1323518.0, ans=0.125 2023-10-13 08:14:13,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1323518.0, ans=0.0 2023-10-13 08:14:26,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1323564.6666666667, ans=0.025 2023-10-13 08:14:32,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1323611.3333333333, ans=0.125 2023-10-13 08:14:33,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1323611.3333333333, ans=0.0 2023-10-13 08:14:33,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1323611.3333333333, ans=0.025 2023-10-13 08:14:47,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1323658.0, ans=0.125 2023-10-13 08:14:57,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.51 vs. limit=10.0 2023-10-13 08:15:01,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-10-13 08:15:01,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1323751.3333333333, ans=0.0 2023-10-13 08:15:21,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.744e+02 1.870e+02 2.068e+02 3.196e+02, threshold=3.740e+02, percent-clipped=0.0 2023-10-13 08:15:22,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1323798.0, ans=0.2 2023-10-13 08:15:50,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1323891.3333333333, ans=0.125 2023-10-13 08:15:56,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.95 vs. limit=15.0 2023-10-13 08:15:58,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1323938.0, ans=0.04949747468305833 2023-10-13 08:16:06,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1323984.6666666667, ans=0.0 2023-10-13 08:16:07,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1323984.6666666667, ans=0.0 2023-10-13 08:16:15,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1323984.6666666667, ans=0.0 2023-10-13 08:16:38,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1324078.0, ans=0.125 2023-10-13 08:16:40,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-10-13 08:16:54,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1324171.3333333333, ans=0.0 2023-10-13 08:16:57,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1324171.3333333333, ans=0.0 2023-10-13 08:17:06,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1324218.0, ans=0.125 2023-10-13 08:17:08,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324218.0, ans=0.1 2023-10-13 08:17:18,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.811e+02 1.996e+02 2.293e+02 3.645e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-13 08:17:29,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1324311.3333333333, ans=0.125 2023-10-13 08:17:29,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-10-13 08:17:35,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=22.5 2023-10-13 08:17:55,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1324404.6666666667, ans=0.5 2023-10-13 08:18:32,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.97 vs. limit=22.5 2023-10-13 08:18:51,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1324638.0, ans=0.125 2023-10-13 08:18:52,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.99 vs. limit=22.5 2023-10-13 08:19:20,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.774e+02 1.966e+02 2.176e+02 2.762e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 08:19:40,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1324824.6666666667, ans=0.04949747468305833 2023-10-13 08:19:54,211 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:20:12,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1324964.6666666667, ans=0.125 2023-10-13 08:20:12,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1324964.6666666667, ans=0.1 2023-10-13 08:20:14,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.01 vs. limit=15.0 2023-10-13 08:20:35,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1325058.0, ans=0.125 2023-10-13 08:20:59,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1325151.3333333333, ans=0.125 2023-10-13 08:21:06,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325198.0, ans=0.1 2023-10-13 08:21:08,143 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=22.5 2023-10-13 08:21:13,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.752e+02 1.939e+02 2.140e+02 2.799e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-13 08:21:34,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1325291.3333333333, ans=0.125 2023-10-13 08:21:46,262 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-10-13 08:22:03,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1325431.3333333333, ans=0.2 2023-10-13 08:22:13,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1325478.0, ans=0.125 2023-10-13 08:22:50,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.94 vs. limit=15.0 2023-10-13 08:22:52,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1325618.0, ans=0.125 2023-10-13 08:22:54,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=22.5 2023-10-13 08:23:00,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.738e+02 1.871e+02 2.120e+02 2.926e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-13 08:23:13,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1325711.3333333333, ans=0.0 2023-10-13 08:23:17,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1325758.0, ans=0.0 2023-10-13 08:23:24,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1325758.0, ans=0.0 2023-10-13 08:23:28,997 INFO [train.py:1031] (3/4) Epoch 21, batch 11000, loss[loss=0.1815, simple_loss=0.2797, pruned_loss=0.04162, over 16874.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2814, pruned_loss=0.04905, over 32643108.69 frames. ], batch size: 72, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 08:23:30,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2023-10-13 08:23:32,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1325804.6666666667, ans=0.125 2023-10-13 08:23:46,936 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:23:54,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=12.0 2023-10-13 08:23:56,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1325898.0, ans=0.1 2023-10-13 08:23:57,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1325898.0, ans=0.125 2023-10-13 08:23:59,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1325898.0, ans=0.035 2023-10-13 08:24:09,564 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:24:11,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1325991.3333333333, ans=0.125 2023-10-13 08:24:12,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1325991.3333333333, ans=0.0 2023-10-13 08:24:31,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1326038.0, ans=0.1 2023-10-13 08:24:51,168 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-10-13 08:24:53,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1326131.3333333333, ans=0.125 2023-10-13 08:24:53,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1326131.3333333333, ans=0.05 2023-10-13 08:24:54,271 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:24:54,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.861e+02 1.969e+02 2.170e+02 2.905e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-13 08:24:59,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1326178.0, ans=0.125 2023-10-13 08:25:02,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-10-13 08:25:13,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1326224.6666666667, ans=0.0 2023-10-13 08:26:13,365 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=15.0 2023-10-13 08:26:56,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.704e+02 1.879e+02 2.084e+02 3.045e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-13 08:26:58,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1326598.0, ans=0.125 2023-10-13 08:27:03,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1326644.6666666667, ans=0.125 2023-10-13 08:27:06,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1326644.6666666667, ans=0.0 2023-10-13 08:27:14,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1326691.3333333333, ans=0.0 2023-10-13 08:27:17,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1326691.3333333333, ans=0.125 2023-10-13 08:27:40,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1326784.6666666667, ans=0.125 2023-10-13 08:27:52,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1326831.3333333333, ans=0.125 2023-10-13 08:27:53,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-10-13 08:28:11,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1326924.6666666667, ans=0.0 2023-10-13 08:28:13,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1326924.6666666667, ans=0.5 2023-10-13 08:28:40,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327018.0, ans=0.1 2023-10-13 08:28:51,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.766e+02 1.894e+02 2.103e+02 2.601e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-13 08:28:59,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327111.3333333333, ans=0.1 2023-10-13 08:29:04,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2023-10-13 08:29:17,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1327158.0, ans=0.125 2023-10-13 08:29:32,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1327251.3333333333, ans=0.125 2023-10-13 08:30:06,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.45 vs. limit=15.0 2023-10-13 08:30:24,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1327438.0, ans=0.125 2023-10-13 08:30:43,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1327531.3333333333, ans=0.0 2023-10-13 08:30:48,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.759e+02 1.910e+02 2.121e+02 3.702e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 08:30:53,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1327578.0, ans=0.07 2023-10-13 08:30:54,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1327578.0, ans=0.125 2023-10-13 08:31:12,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-13 08:32:11,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1327858.0, ans=0.2 2023-10-13 08:32:19,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1327904.6666666667, ans=0.125 2023-10-13 08:32:29,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-13 08:32:42,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1327998.0, ans=0.09899494936611666 2023-10-13 08:32:42,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.889e+02 2.027e+02 2.303e+02 3.111e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-13 08:32:47,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1328044.6666666667, ans=0.125 2023-10-13 08:33:04,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-10-13 08:33:09,326 INFO [train.py:1031] (3/4) Epoch 21, batch 11500, loss[loss=0.1973, simple_loss=0.2905, pruned_loss=0.05204, over 16574.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2813, pruned_loss=0.04904, over 32671291.32 frames. ], batch size: 61, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 08:33:35,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1328231.3333333333, ans=0.125 2023-10-13 08:33:43,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1328278.0, ans=0.0 2023-10-13 08:33:44,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1328278.0, ans=0.1 2023-10-13 08:33:48,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1328278.0, ans=0.2 2023-10-13 08:33:59,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1328324.6666666667, ans=0.0 2023-10-13 08:34:05,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1328324.6666666667, ans=0.2 2023-10-13 08:34:07,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1328324.6666666667, ans=0.015 2023-10-13 08:34:13,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1328371.3333333333, ans=10.0 2023-10-13 08:34:16,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1328371.3333333333, ans=0.09899494936611666 2023-10-13 08:34:19,178 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-13 08:34:40,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.799e+02 1.897e+02 2.104e+02 2.824e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-13 08:35:00,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2023-10-13 08:35:05,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1328558.0, ans=0.5 2023-10-13 08:35:12,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=22.5 2023-10-13 08:35:26,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1328651.3333333333, ans=10.0 2023-10-13 08:35:33,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1328698.0, ans=0.0 2023-10-13 08:35:36,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1328698.0, ans=0.2 2023-10-13 08:35:41,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1328698.0, ans=0.1 2023-10-13 08:35:46,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1328744.6666666667, ans=0.0 2023-10-13 08:35:50,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1328744.6666666667, ans=0.125 2023-10-13 08:35:51,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1328744.6666666667, ans=0.125 2023-10-13 08:36:18,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1328884.6666666667, ans=0.125 2023-10-13 08:36:31,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1328931.3333333333, ans=0.125 2023-10-13 08:36:34,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.707e+02 1.849e+02 2.120e+02 2.681e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-13 08:37:04,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.52 vs. limit=22.5 2023-10-13 08:37:08,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1329071.3333333333, ans=0.04949747468305833 2023-10-13 08:38:01,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1329304.6666666667, ans=15.0 2023-10-13 08:38:11,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1329304.6666666667, ans=0.125 2023-10-13 08:38:12,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1329351.3333333333, ans=0.2 2023-10-13 08:38:35,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.768e+02 1.933e+02 2.170e+02 3.062e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-13 08:38:46,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.37 vs. limit=12.0 2023-10-13 08:38:51,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329491.3333333333, ans=0.1 2023-10-13 08:38:53,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1329491.3333333333, ans=0.125 2023-10-13 08:38:54,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1329491.3333333333, ans=0.04949747468305833 2023-10-13 08:39:00,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1329491.3333333333, ans=0.125 2023-10-13 08:39:03,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.69 vs. limit=15.0 2023-10-13 08:39:25,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1329631.3333333333, ans=15.0 2023-10-13 08:39:47,388 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:39:47,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-13 08:39:48,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1329724.6666666667, ans=0.125 2023-10-13 08:40:10,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1329771.3333333333, ans=0.2 2023-10-13 08:40:14,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1329818.0, ans=0.125 2023-10-13 08:40:21,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1329864.6666666667, ans=0.125 2023-10-13 08:40:21,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1329864.6666666667, ans=0.2 2023-10-13 08:40:29,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.282e+02 1.718e+02 1.898e+02 2.079e+02 2.900e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-13 08:40:31,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1329864.6666666667, ans=0.1 2023-10-13 08:40:50,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-10-13 08:40:55,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329958.0, ans=0.1 2023-10-13 08:41:03,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1330004.6666666667, ans=0.0 2023-10-13 08:41:10,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1330051.3333333333, ans=0.125 2023-10-13 08:41:18,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1330051.3333333333, ans=0.125 2023-10-13 08:41:21,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-10-13 08:41:32,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1330098.0, ans=0.125 2023-10-13 08:41:33,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1330144.6666666667, ans=0.0 2023-10-13 08:41:56,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.72 vs. limit=15.0 2023-10-13 08:41:57,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-13 08:42:25,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-10-13 08:42:28,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.776e+02 1.933e+02 2.227e+02 3.065e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-13 08:42:28,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1330331.3333333333, ans=0.0 2023-10-13 08:42:54,944 INFO [train.py:1031] (3/4) Epoch 21, batch 12000, loss[loss=0.1861, simple_loss=0.2719, pruned_loss=0.05012, over 15577.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2813, pruned_loss=0.04877, over 32698794.00 frames. ], batch size: 35, lr: 1.63e-03, grad_scale: 32.0 2023-10-13 08:43:16,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1330518.0, ans=0.2 2023-10-13 08:43:27,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.09 vs. limit=15.0 2023-10-13 08:43:37,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1330611.3333333333, ans=0.125 2023-10-13 08:43:47,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1330658.0, ans=0.2 2023-10-13 08:43:50,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1330658.0, ans=0.125 2023-10-13 08:43:55,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1330704.6666666667, ans=0.125 2023-10-13 08:43:55,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-10-13 08:43:57,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1330704.6666666667, ans=0.0 2023-10-13 08:44:05,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1330751.3333333333, ans=0.2 2023-10-13 08:44:06,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1330751.3333333333, ans=0.125 2023-10-13 08:44:17,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1330798.0, ans=0.2 2023-10-13 08:44:24,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.675e+02 1.778e+02 1.966e+02 2.595e+02, threshold=3.555e+02, percent-clipped=0.0 2023-10-13 08:44:32,486 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-13 08:44:48,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1330938.0, ans=0.125 2023-10-13 08:44:59,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1330984.6666666667, ans=0.125 2023-10-13 08:45:16,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1331031.3333333333, ans=0.2 2023-10-13 08:45:40,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1331124.6666666667, ans=0.0 2023-10-13 08:45:48,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1331171.3333333333, ans=0.125 2023-10-13 08:45:49,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-10-13 08:46:10,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.771e+02 1.958e+02 2.123e+02 2.822e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-13 08:46:36,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1331404.6666666667, ans=0.0 2023-10-13 08:46:42,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1331404.6666666667, ans=0.125 2023-10-13 08:46:43,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1331404.6666666667, ans=0.0 2023-10-13 08:46:50,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1331451.3333333333, ans=0.035 2023-10-13 08:46:55,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1331498.0, ans=0.125 2023-10-13 08:47:03,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1331498.0, ans=0.025 2023-10-13 08:47:18,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1331544.6666666667, ans=0.125 2023-10-13 08:47:34,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331638.0, ans=0.1 2023-10-13 08:47:35,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-13 08:47:40,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1331638.0, ans=0.2 2023-10-13 08:48:00,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.935e+02 2.218e+02 2.487e+02 3.290e+02, threshold=4.437e+02, percent-clipped=0.0 2023-10-13 08:48:32,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-10-13 08:49:03,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1332011.3333333333, ans=0.125 2023-10-13 08:49:15,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1332058.0, ans=0.035 2023-10-13 08:49:17,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1332058.0, ans=0.0 2023-10-13 08:49:22,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1332104.6666666667, ans=0.2 2023-10-13 08:49:33,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1332151.3333333333, ans=0.125 2023-10-13 08:49:55,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.775e+02 1.948e+02 2.123e+02 2.809e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-13 08:49:58,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1332244.6666666667, ans=0.125 2023-10-13 08:50:01,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1332244.6666666667, ans=0.125 2023-10-13 08:50:19,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=1332338.0, ans=0.1 2023-10-13 08:50:32,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.31 vs. limit=15.0 2023-10-13 08:50:37,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1332384.6666666667, ans=0.0 2023-10-13 08:50:38,689 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-13 08:50:43,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1332431.3333333333, ans=0.125 2023-10-13 08:50:46,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.33 vs. limit=12.0 2023-10-13 08:51:02,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-10-13 08:51:32,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1332618.0, ans=0.125 2023-10-13 08:51:35,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1332618.0, ans=0.0 2023-10-13 08:51:40,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1332664.6666666667, ans=0.125 2023-10-13 08:51:48,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.801e+02 1.924e+02 2.122e+02 2.835e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-13 08:51:52,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1332711.3333333333, ans=0.05 2023-10-13 08:52:12,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1332758.0, ans=0.0 2023-10-13 08:52:15,116 INFO [train.py:1031] (3/4) Epoch 21, batch 12500, loss[loss=0.2086, simple_loss=0.2996, pruned_loss=0.0588, over 16959.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2807, pruned_loss=0.04852, over 32737925.34 frames. ], batch size: 123, lr: 1.63e-03, grad_scale: 32.0 2023-10-13 08:52:15,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.73 vs. limit=6.0 2023-10-13 08:52:29,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332851.3333333333, ans=0.1 2023-10-13 08:52:47,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1332944.6666666667, ans=0.125 2023-10-13 08:53:38,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.728e+02 1.904e+02 2.146e+02 2.661e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 08:53:49,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1333178.0, ans=0.125 2023-10-13 08:53:55,282 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:54:14,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1333318.0, ans=0.125 2023-10-13 08:54:16,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1333318.0, ans=0.2 2023-10-13 08:54:27,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.67 vs. limit=22.5 2023-10-13 08:54:30,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1333364.6666666667, ans=0.2 2023-10-13 08:54:33,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1333364.6666666667, ans=0.125 2023-10-13 08:54:36,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.04 vs. limit=15.0 2023-10-13 08:55:16,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1333551.3333333333, ans=0.125 2023-10-13 08:55:30,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1333598.0, ans=0.125 2023-10-13 08:55:33,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.744e+02 1.975e+02 2.227e+02 3.194e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 08:55:47,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1333691.3333333333, ans=0.025 2023-10-13 08:56:11,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1333784.6666666667, ans=0.125 2023-10-13 08:56:14,999 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=15.0 2023-10-13 08:56:40,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1333924.6666666667, ans=0.125 2023-10-13 08:56:40,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1333924.6666666667, ans=0.125 2023-10-13 08:56:40,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1333924.6666666667, ans=0.125 2023-10-13 08:56:54,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1333971.3333333333, ans=0.0 2023-10-13 08:57:22,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.792e+02 1.940e+02 2.189e+02 3.673e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 08:57:29,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1334111.3333333333, ans=0.125 2023-10-13 08:57:46,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1334204.6666666667, ans=0.0 2023-10-13 08:57:47,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1334204.6666666667, ans=0.125 2023-10-13 08:57:48,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1334204.6666666667, ans=0.125 2023-10-13 08:57:59,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1334251.3333333333, ans=0.125 2023-10-13 08:58:04,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1334251.3333333333, ans=0.125 2023-10-13 08:58:10,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1334298.0, ans=0.125 2023-10-13 08:58:17,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1334344.6666666667, ans=0.125 2023-10-13 08:58:17,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1334344.6666666667, ans=0.0 2023-10-13 08:59:15,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.772e+02 1.938e+02 2.134e+02 3.184e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-13 08:59:29,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1334624.6666666667, ans=0.0 2023-10-13 08:59:32,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1334624.6666666667, ans=0.125 2023-10-13 08:59:36,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1334624.6666666667, ans=0.0 2023-10-13 08:59:41,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1334671.3333333333, ans=0.125 2023-10-13 08:59:56,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1334718.0, ans=0.09899494936611666 2023-10-13 09:00:03,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-13 09:00:03,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1334764.6666666667, ans=0.125 2023-10-13 09:00:16,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1334811.3333333333, ans=0.125 2023-10-13 09:00:20,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1334811.3333333333, ans=0.125 2023-10-13 09:00:25,529 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=22.5 2023-10-13 09:00:28,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1334858.0, ans=0.2 2023-10-13 09:00:30,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1334858.0, ans=0.0 2023-10-13 09:00:30,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1334858.0, ans=0.0 2023-10-13 09:00:33,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1334858.0, ans=0.125 2023-10-13 09:00:35,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1334904.6666666667, ans=0.2 2023-10-13 09:00:40,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1334904.6666666667, ans=0.0 2023-10-13 09:00:52,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1334951.3333333333, ans=0.0 2023-10-13 09:00:53,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.78 vs. limit=15.0 2023-10-13 09:00:57,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1334998.0, ans=0.0 2023-10-13 09:01:05,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.725e+02 1.911e+02 2.174e+02 2.756e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 09:01:07,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1335044.6666666667, ans=0.0 2023-10-13 09:01:29,505 INFO [train.py:1031] (3/4) Epoch 21, batch 13000, loss[loss=0.1847, simple_loss=0.2752, pruned_loss=0.04706, over 16889.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2814, pruned_loss=0.04868, over 32756077.40 frames. ], batch size: 130, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 09:01:31,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=15.0 2023-10-13 09:01:37,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1335138.0, ans=0.0 2023-10-13 09:01:38,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1335138.0, ans=0.125 2023-10-13 09:01:57,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1335231.3333333333, ans=0.07 2023-10-13 09:02:13,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1335278.0, ans=0.125 2023-10-13 09:02:22,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1335324.6666666667, ans=0.125 2023-10-13 09:02:42,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1335371.3333333333, ans=0.125 2023-10-13 09:02:58,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=12.0 2023-10-13 09:03:10,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.775e+02 1.946e+02 2.256e+02 3.466e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 09:03:29,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335558.0, ans=0.1 2023-10-13 09:03:52,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1335651.3333333333, ans=0.125 2023-10-13 09:03:54,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1335698.0, ans=0.0 2023-10-13 09:04:19,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-10-13 09:04:29,263 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:04:41,852 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:05:01,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1335931.3333333333, ans=0.2 2023-10-13 09:05:04,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.782e+02 1.898e+02 2.182e+02 3.086e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-13 09:05:10,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-13 09:05:14,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-10-13 09:05:20,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1336024.6666666667, ans=0.125 2023-10-13 09:05:23,179 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:05:24,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1336024.6666666667, ans=0.0 2023-10-13 09:05:33,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2023-10-13 09:05:38,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1336071.3333333333, ans=0.125 2023-10-13 09:05:59,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1336164.6666666667, ans=0.0 2023-10-13 09:06:01,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1336164.6666666667, ans=0.0 2023-10-13 09:06:22,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.17 vs. limit=15.0 2023-10-13 09:06:46,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1336398.0, ans=0.125 2023-10-13 09:06:57,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.722e+02 1.970e+02 2.159e+02 2.816e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-13 09:07:11,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336491.3333333333, ans=0.1 2023-10-13 09:07:16,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1336491.3333333333, ans=0.125 2023-10-13 09:07:33,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1336584.6666666667, ans=0.04949747468305833 2023-10-13 09:07:50,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-10-13 09:07:57,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.22 vs. limit=15.0 2023-10-13 09:08:03,412 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:08:32,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1336818.0, ans=0.125 2023-10-13 09:08:44,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-10-13 09:08:45,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1336864.6666666667, ans=0.125 2023-10-13 09:08:45,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1336864.6666666667, ans=0.125 2023-10-13 09:08:46,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.782e+02 1.955e+02 2.143e+02 3.027e+02, threshold=3.910e+02, percent-clipped=0.0 2023-10-13 09:08:56,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.30 vs. limit=22.5 2023-10-13 09:09:00,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1336958.0, ans=0.125 2023-10-13 09:09:17,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1337004.6666666667, ans=0.125 2023-10-13 09:09:41,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1337098.0, ans=0.125 2023-10-13 09:10:24,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1337284.6666666667, ans=0.0 2023-10-13 09:10:25,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1337284.6666666667, ans=0.125 2023-10-13 09:10:38,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1337331.3333333333, ans=0.2 2023-10-13 09:10:41,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.786e+02 1.935e+02 2.197e+02 6.537e+02, threshold=3.870e+02, percent-clipped=1.0 2023-10-13 09:10:41,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-10-13 09:11:02,398 INFO [train.py:1031] (3/4) Epoch 21, batch 13500, loss[loss=0.1899, simple_loss=0.2878, pruned_loss=0.04598, over 16858.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2808, pruned_loss=0.04854, over 32766555.81 frames. ], batch size: 146, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 09:11:04,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-10-13 09:11:26,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1337564.6666666667, ans=0.125 2023-10-13 09:11:36,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1337611.3333333333, ans=0.2 2023-10-13 09:11:44,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.00 vs. limit=10.0 2023-10-13 09:11:46,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1337658.0, ans=0.125 2023-10-13 09:12:11,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=15.0 2023-10-13 09:12:34,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.740e+02 1.957e+02 2.153e+02 3.011e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 09:12:45,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1337844.6666666667, ans=0.1 2023-10-13 09:12:53,753 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:12:57,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1337938.0, ans=0.125 2023-10-13 09:13:33,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1338078.0, ans=0.1 2023-10-13 09:13:37,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-13 09:13:44,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1338171.3333333333, ans=0.0 2023-10-13 09:14:20,697 INFO [train.py:1031] (3/4) Epoch 22, batch 0, loss[loss=0.1645, simple_loss=0.2585, pruned_loss=0.03523, over 16894.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2585, pruned_loss=0.03523, over 16894.00 frames. ], batch size: 130, lr: 1.59e-03, grad_scale: 32.0 2023-10-13 09:14:20,698 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-13 09:14:28,991 INFO [train.py:1063] (3/4) Epoch 22, validation: loss=0.2133, simple_loss=0.3005, pruned_loss=0.06308, over 1020973.00 frames. 2023-10-13 09:14:28,992 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16891MB 2023-10-13 09:14:39,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1338241.3333333333, ans=0.0 2023-10-13 09:14:53,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2023-10-13 09:14:54,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338288.0, ans=0.1 2023-10-13 09:14:58,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.740e+02 1.941e+02 2.149e+02 4.129e+02, threshold=3.883e+02, percent-clipped=1.0 2023-10-13 09:15:06,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1338334.6666666667, ans=0.125 2023-10-13 09:15:06,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1338334.6666666667, ans=0.1 2023-10-13 09:15:15,722 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-10-13 09:15:37,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1338428.0, ans=0.0 2023-10-13 09:16:10,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1338568.0, ans=0.125 2023-10-13 09:16:30,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1338661.3333333333, ans=0.125 2023-10-13 09:16:33,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.73 vs. limit=22.5 2023-10-13 09:16:43,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1338708.0, ans=0.2 2023-10-13 09:16:44,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1338708.0, ans=0.0 2023-10-13 09:16:51,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.92 vs. limit=12.0 2023-10-13 09:16:52,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1338754.6666666667, ans=0.2 2023-10-13 09:16:54,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=22.5 2023-10-13 09:16:59,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.741e+02 1.848e+02 2.031e+02 2.604e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-13 09:17:16,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1338848.0, ans=0.125 2023-10-13 09:17:17,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1338848.0, ans=0.2 2023-10-13 09:17:23,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1338848.0, ans=0.125 2023-10-13 09:17:23,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1338848.0, ans=0.0 2023-10-13 09:18:00,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1338988.0, ans=0.0 2023-10-13 09:18:08,290 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.34 vs. limit=6.0 2023-10-13 09:18:47,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-10-13 09:18:50,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1339221.3333333333, ans=0.125 2023-10-13 09:18:52,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=22.5 2023-10-13 09:18:56,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.751e+02 1.884e+02 2.078e+02 2.746e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 09:19:13,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=15.0 2023-10-13 09:19:17,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1339314.6666666667, ans=0.0 2023-10-13 09:19:22,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-13 09:19:27,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1339314.6666666667, ans=0.125 2023-10-13 09:19:29,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.35 vs. limit=15.0 2023-10-13 09:19:44,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1339408.0, ans=0.125 2023-10-13 09:19:49,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1339408.0, ans=0.125 2023-10-13 09:20:09,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1339501.3333333333, ans=0.0 2023-10-13 09:20:31,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1339594.6666666667, ans=0.125 2023-10-13 09:20:42,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1339641.3333333333, ans=0.125 2023-10-13 09:20:42,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1339641.3333333333, ans=0.2 2023-10-13 09:20:53,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.792e+02 1.962e+02 2.143e+02 3.064e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-13 09:21:05,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.82 vs. limit=15.0 2023-10-13 09:21:17,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1339781.3333333333, ans=0.2 2023-10-13 09:21:18,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1339828.0, ans=0.2 2023-10-13 09:21:19,724 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:21:50,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339921.3333333333, ans=0.1 2023-10-13 09:21:56,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1339968.0, ans=0.1 2023-10-13 09:22:07,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1340014.6666666667, ans=0.125 2023-10-13 09:22:25,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2023-10-13 09:22:29,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1340108.0, ans=0.125 2023-10-13 09:22:35,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1340108.0, ans=0.0 2023-10-13 09:22:36,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1340108.0, ans=0.125 2023-10-13 09:22:39,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1340154.6666666667, ans=0.04949747468305833 2023-10-13 09:22:45,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1340154.6666666667, ans=0.04949747468305833 2023-10-13 09:22:46,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.780e+02 1.950e+02 2.263e+02 2.888e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-13 09:22:48,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1340154.6666666667, ans=0.015 2023-10-13 09:22:50,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1340201.3333333333, ans=0.0 2023-10-13 09:22:58,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1340201.3333333333, ans=0.125 2023-10-13 09:23:06,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1340248.0, ans=0.0 2023-10-13 09:23:35,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1340341.3333333333, ans=0.5 2023-10-13 09:23:37,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1340341.3333333333, ans=0.125 2023-10-13 09:23:52,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1340434.6666666667, ans=0.2 2023-10-13 09:23:54,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.41 vs. limit=22.5 2023-10-13 09:24:14,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1340481.3333333333, ans=0.0 2023-10-13 09:24:15,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1340481.3333333333, ans=0.1 2023-10-13 09:24:17,495 INFO [train.py:1031] (3/4) Epoch 22, batch 500, loss[loss=0.1815, simple_loss=0.2661, pruned_loss=0.04843, over 15248.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2811, pruned_loss=0.0487, over 7302019.12 frames. ], batch size: 35, lr: 1.59e-03, grad_scale: 32.0 2023-10-13 09:24:25,281 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:24:27,729 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:24:38,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-13 09:24:48,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.736e+02 1.964e+02 2.240e+02 2.918e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-13 09:24:51,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1340668.0, ans=0.0 2023-10-13 09:24:51,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1340668.0, ans=0.07 2023-10-13 09:24:54,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=22.5 2023-10-13 09:24:55,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.39 vs. limit=15.0 2023-10-13 09:25:37,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1340854.6666666667, ans=0.125 2023-10-13 09:25:53,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-10-13 09:26:12,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1340994.6666666667, ans=0.125 2023-10-13 09:26:42,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.790e+02 1.970e+02 2.216e+02 4.103e+02, threshold=3.940e+02, percent-clipped=1.0 2023-10-13 09:27:06,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1341181.3333333333, ans=0.125 2023-10-13 09:27:09,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1341228.0, ans=0.025 2023-10-13 09:27:11,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1341228.0, ans=0.0 2023-10-13 09:27:14,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1341228.0, ans=0.125 2023-10-13 09:27:22,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1341274.6666666667, ans=0.125 2023-10-13 09:27:39,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 09:27:43,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1341368.0, ans=0.1 2023-10-13 09:27:46,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341368.0, ans=0.1 2023-10-13 09:28:02,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2023-10-13 09:28:04,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.70 vs. limit=15.0 2023-10-13 09:28:04,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-10-13 09:28:10,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1341461.3333333333, ans=0.125 2023-10-13 09:28:35,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1341554.6666666667, ans=0.125 2023-10-13 09:28:36,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.833e+02 1.988e+02 2.219e+02 3.076e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 09:29:00,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=22.5 2023-10-13 09:29:06,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1341694.6666666667, ans=0.0 2023-10-13 09:29:09,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.33 vs. limit=6.0 2023-10-13 09:29:10,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1341694.6666666667, ans=0.125 2023-10-13 09:29:44,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1341834.6666666667, ans=0.2 2023-10-13 09:29:45,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1341834.6666666667, ans=0.125 2023-10-13 09:30:08,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1341928.0, ans=0.04949747468305833 2023-10-13 09:30:11,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1341928.0, ans=0.125 2023-10-13 09:30:15,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1341928.0, ans=0.07 2023-10-13 09:30:41,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.806e+02 2.005e+02 2.271e+02 2.925e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-13 09:30:42,431 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:31:09,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1342114.6666666667, ans=0.1 2023-10-13 09:31:37,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1342254.6666666667, ans=0.95 2023-10-13 09:31:53,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-10-13 09:31:54,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342301.3333333333, ans=0.0 2023-10-13 09:32:25,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-13 09:32:26,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1342441.3333333333, ans=0.125 2023-10-13 09:32:30,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342441.3333333333, ans=0.1 2023-10-13 09:32:33,759 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.81 vs. limit=15.0 2023-10-13 09:32:46,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.836e+02 2.021e+02 2.191e+02 3.176e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-13 09:32:58,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1342534.6666666667, ans=0.02 2023-10-13 09:33:08,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1342581.3333333333, ans=0.0 2023-10-13 09:33:20,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1342628.0, ans=0.0 2023-10-13 09:33:48,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1342721.3333333333, ans=0.04949747468305833 2023-10-13 09:34:06,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1342814.6666666667, ans=0.2 2023-10-13 09:34:14,727 INFO [train.py:1031] (3/4) Epoch 22, batch 1000, loss[loss=0.1936, simple_loss=0.2757, pruned_loss=0.05581, over 15135.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2819, pruned_loss=0.04906, over 12954593.63 frames. ], batch size: 35, lr: 1.58e-03, grad_scale: 16.0 2023-10-13 09:34:17,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.27 vs. limit=15.0 2023-10-13 09:34:19,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1342861.3333333333, ans=0.0 2023-10-13 09:34:21,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1342861.3333333333, ans=0.0 2023-10-13 09:34:29,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1342908.0, ans=0.125 2023-10-13 09:34:35,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342954.6666666667, ans=0.1 2023-10-13 09:34:43,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.681e+02 1.793e+02 1.972e+02 2.395e+02, threshold=3.586e+02, percent-clipped=0.0 2023-10-13 09:34:47,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-13 09:34:51,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343001.3333333333, ans=0.1 2023-10-13 09:34:51,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1343001.3333333333, ans=0.2 2023-10-13 09:34:55,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1343048.0, ans=10.0 2023-10-13 09:35:37,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1343234.6666666667, ans=0.125 2023-10-13 09:36:02,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1343328.0, ans=0.125 2023-10-13 09:36:05,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1343328.0, ans=0.95 2023-10-13 09:36:11,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1343328.0, ans=0.0 2023-10-13 09:36:15,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1343374.6666666667, ans=0.125 2023-10-13 09:36:39,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.824e+02 2.025e+02 2.383e+02 3.530e+02, threshold=4.050e+02, percent-clipped=0.0 2023-10-13 09:36:40,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1343421.3333333333, ans=0.125 2023-10-13 09:36:46,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1343468.0, ans=0.125 2023-10-13 09:37:22,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1343608.0, ans=0.1 2023-10-13 09:37:40,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1343654.6666666667, ans=0.125 2023-10-13 09:37:55,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1343701.3333333333, ans=0.2 2023-10-13 09:37:55,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1343701.3333333333, ans=0.0 2023-10-13 09:38:12,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343748.0, ans=0.1 2023-10-13 09:38:13,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1343794.6666666667, ans=0.125 2023-10-13 09:38:45,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-13 09:38:48,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.670e+02 1.795e+02 1.958e+02 3.422e+02, threshold=3.591e+02, percent-clipped=0.0 2023-10-13 09:39:04,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1343981.3333333333, ans=0.0 2023-10-13 09:39:15,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1343981.3333333333, ans=0.2 2023-10-13 09:39:24,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1344028.0, ans=0.2 2023-10-13 09:39:28,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-10-13 09:39:33,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1344074.6666666667, ans=0.2 2023-10-13 09:39:42,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1344121.3333333333, ans=0.125 2023-10-13 09:39:52,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1344168.0, ans=0.0 2023-10-13 09:40:21,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1344261.3333333333, ans=0.0 2023-10-13 09:40:34,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1344308.0, ans=0.125 2023-10-13 09:40:39,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1344354.6666666667, ans=0.0 2023-10-13 09:40:47,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.854e+02 2.061e+02 2.387e+02 3.180e+02, threshold=4.123e+02, percent-clipped=0.0 2023-10-13 09:40:55,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1344401.3333333333, ans=0.04949747468305833 2023-10-13 09:41:08,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1344448.0, ans=0.1 2023-10-13 09:41:12,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1344494.6666666667, ans=0.125 2023-10-13 09:42:12,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1344728.0, ans=0.0 2023-10-13 09:42:36,722 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.81 vs. limit=15.0 2023-10-13 09:42:45,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.788e+02 1.936e+02 2.112e+02 2.690e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-13 09:43:36,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2023-10-13 09:43:38,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1345054.6666666667, ans=0.07 2023-10-13 09:44:01,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1345148.0, ans=0.0 2023-10-13 09:44:11,837 INFO [train.py:1031] (3/4) Epoch 22, batch 1500, loss[loss=0.2073, simple_loss=0.2916, pruned_loss=0.06148, over 15976.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2803, pruned_loss=0.04843, over 17372785.29 frames. ], batch size: 296, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 09:44:36,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1345288.0, ans=0.125 2023-10-13 09:44:38,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1345288.0, ans=0.2 2023-10-13 09:44:45,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.786e+02 2.057e+02 2.352e+02 3.280e+02, threshold=4.113e+02, percent-clipped=0.0 2023-10-13 09:44:49,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=15.0 2023-10-13 09:45:02,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1345381.3333333333, ans=0.0 2023-10-13 09:45:05,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1345381.3333333333, ans=0.0 2023-10-13 09:45:11,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1345381.3333333333, ans=0.125 2023-10-13 09:45:15,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-10-13 09:45:59,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1345568.0, ans=0.0 2023-10-13 09:46:17,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1345661.3333333333, ans=0.0 2023-10-13 09:46:22,622 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:46:26,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1345708.0, ans=0.2 2023-10-13 09:46:46,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.771e+02 1.893e+02 2.111e+02 2.847e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-13 09:46:47,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345754.6666666667, ans=0.1 2023-10-13 09:46:51,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1345801.3333333333, ans=0.0 2023-10-13 09:46:58,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345801.3333333333, ans=0.1 2023-10-13 09:47:00,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1345801.3333333333, ans=0.2 2023-10-13 09:47:15,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1345848.0, ans=0.0 2023-10-13 09:47:16,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1345894.6666666667, ans=0.125 2023-10-13 09:47:19,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1345894.6666666667, ans=0.5 2023-10-13 09:47:46,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1345988.0, ans=0.0 2023-10-13 09:48:05,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1346081.3333333333, ans=0.125 2023-10-13 09:48:10,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1346081.3333333333, ans=0.125 2023-10-13 09:48:18,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1346128.0, ans=10.0 2023-10-13 09:48:20,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346128.0, ans=0.1 2023-10-13 09:48:25,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1346128.0, ans=0.125 2023-10-13 09:48:28,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-13 09:48:31,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1346174.6666666667, ans=0.0 2023-10-13 09:48:36,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1346174.6666666667, ans=15.0 2023-10-13 09:48:45,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.817e+02 1.987e+02 2.211e+02 3.152e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-13 09:48:48,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346268.0, ans=0.1 2023-10-13 09:48:53,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1346268.0, ans=0.0 2023-10-13 09:49:00,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1346314.6666666667, ans=0.125 2023-10-13 09:49:06,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1346314.6666666667, ans=0.0 2023-10-13 09:49:06,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1346314.6666666667, ans=0.125 2023-10-13 09:49:24,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1346408.0, ans=0.0 2023-10-13 09:49:46,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1346501.3333333333, ans=0.09899494936611666 2023-10-13 09:49:55,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=22.5 2023-10-13 09:50:01,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1346548.0, ans=0.0 2023-10-13 09:50:03,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-10-13 09:50:18,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1346594.6666666667, ans=0.125 2023-10-13 09:50:48,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.781e+02 1.957e+02 2.197e+02 3.261e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 09:51:26,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1346828.0, ans=0.0 2023-10-13 09:51:37,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-13 09:51:53,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1346968.0, ans=0.0 2023-10-13 09:52:01,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1347014.6666666667, ans=0.0 2023-10-13 09:52:21,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1347061.3333333333, ans=0.5 2023-10-13 09:52:37,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1347154.6666666667, ans=0.0 2023-10-13 09:52:38,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1347154.6666666667, ans=0.0 2023-10-13 09:52:47,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.855e+02 2.017e+02 2.260e+02 4.426e+02, threshold=4.034e+02, percent-clipped=1.0 2023-10-13 09:53:17,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1347248.0, ans=0.0 2023-10-13 09:53:51,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1347341.3333333333, ans=0.125 2023-10-13 09:53:51,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-10-13 09:53:58,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1347388.0, ans=10.0 2023-10-13 09:53:59,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1347388.0, ans=0.125 2023-10-13 09:54:03,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.10 vs. limit=15.0 2023-10-13 09:54:08,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1347434.6666666667, ans=0.2 2023-10-13 09:54:15,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1347434.6666666667, ans=0.125 2023-10-13 09:54:31,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347481.3333333333, ans=0.1 2023-10-13 09:54:36,078 INFO [train.py:1031] (3/4) Epoch 22, batch 2000, loss[loss=0.1975, simple_loss=0.293, pruned_loss=0.05097, over 16856.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.281, pruned_loss=0.04855, over 20796435.69 frames. ], batch size: 175, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 09:55:19,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.717e+02 1.874e+02 2.067e+02 3.063e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 09:55:25,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1347668.0, ans=0.125 2023-10-13 09:55:28,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1347668.0, ans=0.125 2023-10-13 09:55:49,750 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=15.0 2023-10-13 09:56:32,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1347901.3333333333, ans=0.125 2023-10-13 09:56:45,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347948.0, ans=0.1 2023-10-13 09:56:46,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1347948.0, ans=0.125 2023-10-13 09:56:49,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347948.0, ans=0.1 2023-10-13 09:56:53,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=15.0 2023-10-13 09:57:09,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1347994.6666666667, ans=10.0 2023-10-13 09:57:46,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1348088.0, ans=0.0 2023-10-13 09:57:51,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.725e+02 1.936e+02 2.204e+02 2.901e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-13 09:57:57,883 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.92 vs. limit=15.0 2023-10-13 09:58:14,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1348181.3333333333, ans=0.1 2023-10-13 09:58:14,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-13 09:58:26,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-10-13 09:59:01,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1348368.0, ans=0.125 2023-10-13 09:59:12,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1348368.0, ans=0.125 2023-10-13 09:59:13,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1348368.0, ans=0.125 2023-10-13 09:59:44,017 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:59:50,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1348554.6666666667, ans=0.2 2023-10-13 09:59:56,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1348554.6666666667, ans=0.125 2023-10-13 10:00:00,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.904e+02 2.114e+02 2.411e+02 3.158e+02, threshold=4.228e+02, percent-clipped=0.0 2023-10-13 10:00:22,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1348648.0, ans=0.125 2023-10-13 10:00:44,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1348741.3333333333, ans=0.04949747468305833 2023-10-13 10:00:58,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-13 10:01:02,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1348834.6666666667, ans=0.125 2023-10-13 10:01:17,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2023-10-13 10:01:25,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-10-13 10:01:51,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1349021.3333333333, ans=0.2 2023-10-13 10:02:00,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.823e+02 1.945e+02 2.144e+02 2.654e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 10:02:05,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1349068.0, ans=0.95 2023-10-13 10:02:17,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1349114.6666666667, ans=0.0 2023-10-13 10:02:17,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1349114.6666666667, ans=0.125 2023-10-13 10:02:19,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-10-13 10:02:29,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1349161.3333333333, ans=0.125 2023-10-13 10:02:43,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1349208.0, ans=0.025 2023-10-13 10:03:32,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1349394.6666666667, ans=0.125 2023-10-13 10:03:33,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1349441.3333333333, ans=0.0 2023-10-13 10:03:38,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1349441.3333333333, ans=0.125 2023-10-13 10:03:46,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.81 vs. limit=22.5 2023-10-13 10:03:46,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1349488.0, ans=0.0 2023-10-13 10:03:56,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.827e+02 1.997e+02 2.193e+02 3.325e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-13 10:04:02,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1349534.6666666667, ans=0.0 2023-10-13 10:04:07,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1349581.3333333333, ans=0.125 2023-10-13 10:04:10,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1349581.3333333333, ans=0.125 2023-10-13 10:04:30,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1349674.6666666667, ans=0.1 2023-10-13 10:04:31,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1349674.6666666667, ans=12.0 2023-10-13 10:05:01,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1349768.0, ans=10.0 2023-10-13 10:05:06,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1349814.6666666667, ans=0.125 2023-10-13 10:05:09,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-10-13 10:05:11,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1349814.6666666667, ans=0.0 2023-10-13 10:05:15,071 INFO [train.py:1031] (3/4) Epoch 22, batch 2500, loss[loss=0.1874, simple_loss=0.2765, pruned_loss=0.04918, over 16599.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.281, pruned_loss=0.04859, over 23463253.08 frames. ], batch size: 241, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 10:05:26,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-13 10:05:29,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1349908.0, ans=0.0 2023-10-13 10:05:48,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1349954.6666666667, ans=0.125 2023-10-13 10:05:50,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.795e+02 1.993e+02 2.166e+02 3.553e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 10:05:53,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-10-13 10:06:07,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1350048.0, ans=0.0 2023-10-13 10:06:49,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1350234.6666666667, ans=0.0 2023-10-13 10:06:51,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1350234.6666666667, ans=0.125 2023-10-13 10:07:04,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1350281.3333333333, ans=0.125 2023-10-13 10:07:12,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2023-10-13 10:07:23,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-13 10:07:25,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1350374.6666666667, ans=0.0 2023-10-13 10:07:35,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1350421.3333333333, ans=0.125 2023-10-13 10:07:45,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.743e+02 1.893e+02 2.099e+02 2.638e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-13 10:07:49,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1350468.0, ans=0.125 2023-10-13 10:07:51,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1350468.0, ans=0.125 2023-10-13 10:08:02,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1350514.6666666667, ans=0.125 2023-10-13 10:08:17,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.43 vs. limit=10.0 2023-10-13 10:08:32,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1350654.6666666667, ans=0.125 2023-10-13 10:08:34,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1350654.6666666667, ans=0.0 2023-10-13 10:09:05,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1350748.0, ans=0.0 2023-10-13 10:09:10,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1350794.6666666667, ans=0.125 2023-10-13 10:09:26,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1350841.3333333333, ans=0.0 2023-10-13 10:09:30,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1350841.3333333333, ans=0.125 2023-10-13 10:09:34,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1350841.3333333333, ans=0.125 2023-10-13 10:09:35,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1350841.3333333333, ans=0.0 2023-10-13 10:09:50,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.796e+02 1.918e+02 2.152e+02 2.768e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-13 10:10:14,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1350981.3333333333, ans=0.0 2023-10-13 10:10:25,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=15.0 2023-10-13 10:10:36,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1351074.6666666667, ans=0.0 2023-10-13 10:10:56,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1351121.3333333333, ans=0.015 2023-10-13 10:10:57,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1351121.3333333333, ans=0.125 2023-10-13 10:11:33,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1351261.3333333333, ans=0.0 2023-10-13 10:11:53,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1351354.6666666667, ans=0.125 2023-10-13 10:12:05,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.770e+02 1.961e+02 2.104e+02 2.795e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-13 10:12:17,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1351401.3333333333, ans=0.0 2023-10-13 10:12:28,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=1351448.0, ans=12.0 2023-10-13 10:12:30,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1351448.0, ans=0.125 2023-10-13 10:12:31,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1351448.0, ans=0.0 2023-10-13 10:12:37,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1351494.6666666667, ans=0.2 2023-10-13 10:12:39,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1351494.6666666667, ans=0.2 2023-10-13 10:13:00,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1351541.3333333333, ans=0.125 2023-10-13 10:13:03,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1351588.0, ans=0.1 2023-10-13 10:13:10,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1351588.0, ans=0.125 2023-10-13 10:13:15,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1351634.6666666667, ans=0.0 2023-10-13 10:13:27,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1351681.3333333333, ans=0.0 2023-10-13 10:14:04,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=22.5 2023-10-13 10:14:13,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1351821.3333333333, ans=0.0 2023-10-13 10:14:19,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.747e+02 1.900e+02 2.100e+02 2.578e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-13 10:14:48,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1351961.3333333333, ans=0.05 2023-10-13 10:14:53,689 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-10-13 10:14:54,279 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:15:13,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1352054.6666666667, ans=0.125 2023-10-13 10:15:34,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1352148.0, ans=0.1 2023-10-13 10:15:38,681 INFO [train.py:1031] (3/4) Epoch 22, batch 3000, loss[loss=0.1986, simple_loss=0.2908, pruned_loss=0.05322, over 16922.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2805, pruned_loss=0.04862, over 25557362.58 frames. ], batch size: 156, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 10:15:38,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352194.6666666667, ans=0.1 2023-10-13 10:15:51,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-10-13 10:16:12,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.777e+02 1.997e+02 2.241e+02 2.858e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-13 10:16:28,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.18 vs. limit=6.0 2023-10-13 10:16:33,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1352428.0, ans=0.125 2023-10-13 10:16:37,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-13 10:17:06,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1352521.3333333333, ans=0.0 2023-10-13 10:17:09,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1352521.3333333333, ans=0.0 2023-10-13 10:17:29,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1352614.6666666667, ans=10.0 2023-10-13 10:17:33,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1352661.3333333333, ans=0.0 2023-10-13 10:17:38,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.84 vs. limit=15.0 2023-10-13 10:17:40,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1352661.3333333333, ans=0.5 2023-10-13 10:18:08,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1352754.6666666667, ans=0.0 2023-10-13 10:18:15,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.771e+02 1.929e+02 2.164e+02 3.098e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-13 10:18:15,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1352801.3333333333, ans=0.125 2023-10-13 10:18:34,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1352848.0, ans=0.0 2023-10-13 10:18:46,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1352941.3333333333, ans=0.0 2023-10-13 10:18:50,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1352941.3333333333, ans=0.125 2023-10-13 10:18:50,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2023-10-13 10:18:51,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.94 vs. limit=10.0 2023-10-13 10:19:07,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1352988.0, ans=0.1 2023-10-13 10:19:08,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1352988.0, ans=0.0 2023-10-13 10:19:08,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1352988.0, ans=0.2 2023-10-13 10:19:11,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1353034.6666666667, ans=0.125 2023-10-13 10:19:18,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1353034.6666666667, ans=0.125 2023-10-13 10:19:24,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-13 10:19:30,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1353081.3333333333, ans=0.125 2023-10-13 10:19:50,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353174.6666666667, ans=0.1 2023-10-13 10:20:06,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1353221.3333333333, ans=0.2 2023-10-13 10:20:07,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2023-10-13 10:20:14,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.752e+02 1.912e+02 2.105e+02 3.038e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-13 10:20:31,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1353314.6666666667, ans=0.125 2023-10-13 10:20:40,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1353361.3333333333, ans=0.0 2023-10-13 10:21:23,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1353501.3333333333, ans=0.0 2023-10-13 10:21:26,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1353501.3333333333, ans=0.125 2023-10-13 10:21:43,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1353594.6666666667, ans=0.0 2023-10-13 10:21:59,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1353641.3333333333, ans=0.125 2023-10-13 10:22:00,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=1353641.3333333333, ans=0.02 2023-10-13 10:22:00,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1353641.3333333333, ans=0.125 2023-10-13 10:22:14,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-10-13 10:22:17,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353688.0, ans=0.1 2023-10-13 10:22:22,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1353734.6666666667, ans=0.2 2023-10-13 10:22:22,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.67 vs. limit=12.0 2023-10-13 10:22:24,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.799e+02 1.983e+02 2.286e+02 3.832e+02, threshold=3.967e+02, percent-clipped=1.0 2023-10-13 10:22:37,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1353781.3333333333, ans=0.125 2023-10-13 10:22:42,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1353781.3333333333, ans=0.125 2023-10-13 10:22:45,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1353828.0, ans=0.125 2023-10-13 10:22:46,847 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:22:55,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1353874.6666666667, ans=0.2 2023-10-13 10:22:57,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1353874.6666666667, ans=0.125 2023-10-13 10:22:58,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353874.6666666667, ans=0.1 2023-10-13 10:23:07,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353874.6666666667, ans=0.1 2023-10-13 10:23:40,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1353968.0, ans=0.2 2023-10-13 10:24:11,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1354108.0, ans=0.125 2023-10-13 10:24:15,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1354108.0, ans=0.125 2023-10-13 10:24:19,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1354108.0, ans=0.125 2023-10-13 10:24:34,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.789e+02 1.973e+02 2.192e+02 3.127e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-13 10:24:40,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1354201.3333333333, ans=0.125 2023-10-13 10:24:52,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-10-13 10:25:19,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-10-13 10:25:54,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1354528.0, ans=0.125 2023-10-13 10:25:54,908 INFO [train.py:1031] (3/4) Epoch 22, batch 3500, loss[loss=0.2015, simple_loss=0.2929, pruned_loss=0.05504, over 16939.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2804, pruned_loss=0.04879, over 27155286.36 frames. ], batch size: 123, lr: 1.58e-03, grad_scale: 16.0 2023-10-13 10:26:14,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1354574.6666666667, ans=0.0 2023-10-13 10:26:24,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1354621.3333333333, ans=0.125 2023-10-13 10:26:26,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1354621.3333333333, ans=0.125 2023-10-13 10:26:31,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.775e+02 1.923e+02 2.142e+02 3.464e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 10:26:47,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1354668.0, ans=0.2 2023-10-13 10:27:18,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1354808.0, ans=0.0 2023-10-13 10:27:51,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-13 10:27:56,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1354901.3333333333, ans=0.1 2023-10-13 10:28:05,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2023-10-13 10:28:25,370 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:28:35,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1355041.3333333333, ans=0.0 2023-10-13 10:28:51,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355134.6666666667, ans=0.1 2023-10-13 10:28:53,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.838e+02 1.983e+02 2.163e+02 2.976e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-13 10:28:53,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1355134.6666666667, ans=0.0 2023-10-13 10:28:58,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1355134.6666666667, ans=0.125 2023-10-13 10:29:06,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1355181.3333333333, ans=0.125 2023-10-13 10:29:29,815 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-10-13 10:29:40,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1355321.3333333333, ans=0.125 2023-10-13 10:29:43,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1355321.3333333333, ans=0.125 2023-10-13 10:29:45,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1355321.3333333333, ans=0.125 2023-10-13 10:29:50,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1355368.0, ans=0.0 2023-10-13 10:29:50,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=15.0 2023-10-13 10:29:52,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1355368.0, ans=0.0 2023-10-13 10:30:01,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1355414.6666666667, ans=0.2 2023-10-13 10:30:07,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355414.6666666667, ans=0.1 2023-10-13 10:30:19,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1355461.3333333333, ans=0.0 2023-10-13 10:30:20,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1355461.3333333333, ans=0.125 2023-10-13 10:30:26,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1355508.0, ans=0.125 2023-10-13 10:30:36,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355508.0, ans=0.1 2023-10-13 10:30:59,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.717e+02 1.888e+02 2.057e+02 3.044e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-13 10:31:05,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1355601.3333333333, ans=0.2 2023-10-13 10:31:07,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355648.0, ans=0.1 2023-10-13 10:31:30,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.83 vs. limit=10.0 2023-10-13 10:31:36,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1355741.3333333333, ans=0.125 2023-10-13 10:31:53,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1355788.0, ans=0.125 2023-10-13 10:32:40,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1355974.6666666667, ans=0.125 2023-10-13 10:32:54,827 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-10-13 10:33:01,616 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.893e+02 2.171e+02 2.320e+02 3.701e+02, threshold=4.342e+02, percent-clipped=0.0 2023-10-13 10:33:06,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1356068.0, ans=0.0 2023-10-13 10:33:17,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1356114.6666666667, ans=0.125 2023-10-13 10:33:30,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1356161.3333333333, ans=0.1 2023-10-13 10:33:55,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1356254.6666666667, ans=0.2 2023-10-13 10:34:08,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1356348.0, ans=0.125 2023-10-13 10:34:54,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1356534.6666666667, ans=0.125 2023-10-13 10:34:58,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.721e+02 1.916e+02 2.047e+02 2.709e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-13 10:35:13,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1356581.3333333333, ans=0.1 2023-10-13 10:35:35,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1356674.6666666667, ans=0.2 2023-10-13 10:35:39,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1356674.6666666667, ans=0.125 2023-10-13 10:35:47,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1356721.3333333333, ans=0.0 2023-10-13 10:35:51,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1356721.3333333333, ans=0.125 2023-10-13 10:36:02,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1356768.0, ans=0.0 2023-10-13 10:36:08,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1356814.6666666667, ans=0.125 2023-10-13 10:36:10,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1356814.6666666667, ans=0.0 2023-10-13 10:36:16,442 INFO [train.py:1031] (3/4) Epoch 22, batch 4000, loss[loss=0.1797, simple_loss=0.275, pruned_loss=0.04218, over 16931.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2801, pruned_loss=0.04889, over 28375258.38 frames. ], batch size: 82, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 10:36:20,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1356861.3333333333, ans=0.125 2023-10-13 10:36:26,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1356861.3333333333, ans=0.2 2023-10-13 10:36:58,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.864e+02 2.093e+02 2.403e+02 3.130e+02, threshold=4.186e+02, percent-clipped=0.0 2023-10-13 10:37:03,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1357001.3333333333, ans=10.0 2023-10-13 10:37:10,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1357048.0, ans=0.1 2023-10-13 10:37:27,852 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:37:28,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1357141.3333333333, ans=0.125 2023-10-13 10:37:31,725 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:37:55,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1357234.6666666667, ans=0.0 2023-10-13 10:37:57,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1357234.6666666667, ans=0.0 2023-10-13 10:38:05,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1357281.3333333333, ans=0.125 2023-10-13 10:38:09,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1357281.3333333333, ans=0.125 2023-10-13 10:38:10,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1357281.3333333333, ans=0.2 2023-10-13 10:38:14,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1357328.0, ans=0.0 2023-10-13 10:38:20,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1357328.0, ans=0.1 2023-10-13 10:38:39,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-13 10:38:41,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1357421.3333333333, ans=0.125 2023-10-13 10:38:41,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-10-13 10:38:48,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1357421.3333333333, ans=0.125 2023-10-13 10:38:57,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.799e+02 1.969e+02 2.264e+02 3.202e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-13 10:39:02,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1357468.0, ans=0.125 2023-10-13 10:39:07,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-10-13 10:39:21,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-13 10:39:22,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1357561.3333333333, ans=0.125 2023-10-13 10:39:41,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-10-13 10:39:58,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1357654.6666666667, ans=0.125 2023-10-13 10:39:59,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1357654.6666666667, ans=0.125 2023-10-13 10:40:13,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1357701.3333333333, ans=0.09899494936611666 2023-10-13 10:40:22,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1357748.0, ans=0.1 2023-10-13 10:40:28,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1357794.6666666667, ans=0.09899494936611666 2023-10-13 10:40:31,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1357794.6666666667, ans=0.125 2023-10-13 10:40:42,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1357841.3333333333, ans=0.1 2023-10-13 10:40:44,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-10-13 10:40:59,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1357888.0, ans=0.125 2023-10-13 10:41:04,186 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:41:14,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.757e+02 1.965e+02 2.112e+02 2.978e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-13 10:41:18,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1357934.6666666667, ans=0.0 2023-10-13 10:41:23,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1357981.3333333333, ans=0.0 2023-10-13 10:41:40,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1358028.0, ans=0.125 2023-10-13 10:41:56,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358121.3333333333, ans=0.1 2023-10-13 10:41:59,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1358121.3333333333, ans=0.125 2023-10-13 10:42:03,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358121.3333333333, ans=0.1 2023-10-13 10:42:07,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1358121.3333333333, ans=0.09899494936611666 2023-10-13 10:43:00,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358354.6666666667, ans=0.1 2023-10-13 10:43:05,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=12.0 2023-10-13 10:43:13,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.783e+02 1.943e+02 2.136e+02 2.979e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 10:43:32,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1358448.0, ans=0.125 2023-10-13 10:43:32,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-13 10:43:55,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1358541.3333333333, ans=0.125 2023-10-13 10:44:00,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358588.0, ans=0.1 2023-10-13 10:44:19,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1358634.6666666667, ans=0.125 2023-10-13 10:44:20,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-10-13 10:44:27,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1358681.3333333333, ans=0.2 2023-10-13 10:44:51,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1358774.6666666667, ans=0.09899494936611666 2023-10-13 10:45:23,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.817e+02 1.978e+02 2.218e+02 3.339e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-13 10:45:38,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1358914.6666666667, ans=0.0 2023-10-13 10:45:45,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-13 10:46:07,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1359008.0, ans=0.0 2023-10-13 10:46:24,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1359101.3333333333, ans=10.0 2023-10-13 10:46:26,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1359101.3333333333, ans=0.125 2023-10-13 10:46:27,463 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.09 vs. limit=10.0 2023-10-13 10:46:29,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1359101.3333333333, ans=0.125 2023-10-13 10:46:33,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1359148.0, ans=0.125 2023-10-13 10:46:47,897 INFO [train.py:1031] (3/4) Epoch 22, batch 4500, loss[loss=0.1654, simple_loss=0.2666, pruned_loss=0.03209, over 16873.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2805, pruned_loss=0.04877, over 29348106.31 frames. ], batch size: 87, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 10:46:50,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1359194.6666666667, ans=0.125 2023-10-13 10:47:01,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=22.5 2023-10-13 10:47:25,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.737e+02 1.884e+02 2.066e+02 2.936e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 10:47:34,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1359381.3333333333, ans=0.125 2023-10-13 10:47:42,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1359428.0, ans=0.0 2023-10-13 10:48:00,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1359474.6666666667, ans=0.0 2023-10-13 10:48:04,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.66 vs. limit=6.0 2023-10-13 10:48:07,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2023-10-13 10:48:20,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359568.0, ans=0.1 2023-10-13 10:48:20,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-10-13 10:48:44,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1359661.3333333333, ans=0.1 2023-10-13 10:48:47,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1359661.3333333333, ans=0.125 2023-10-13 10:48:56,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1359708.0, ans=0.0 2023-10-13 10:48:58,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1359708.0, ans=0.125 2023-10-13 10:49:11,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359754.6666666667, ans=0.1 2023-10-13 10:49:19,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.735e+02 1.959e+02 2.214e+02 3.494e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-13 10:49:29,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1359848.0, ans=0.2 2023-10-13 10:49:35,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1359848.0, ans=0.2 2023-10-13 10:49:53,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1359941.3333333333, ans=0.1 2023-10-13 10:50:26,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1360034.6666666667, ans=0.0 2023-10-13 10:51:18,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360221.3333333333, ans=0.125 2023-10-13 10:51:26,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.788e+02 1.942e+02 2.165e+02 3.503e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 10:51:32,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1360314.6666666667, ans=0.05 2023-10-13 10:51:38,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1360314.6666666667, ans=0.125 2023-10-13 10:52:07,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.14 vs. limit=15.0 2023-10-13 10:52:08,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1360454.6666666667, ans=0.2 2023-10-13 10:52:10,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1360454.6666666667, ans=0.1 2023-10-13 10:52:20,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1360501.3333333333, ans=0.0 2023-10-13 10:52:35,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1360548.0, ans=0.125 2023-10-13 10:52:55,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1360641.3333333333, ans=0.125 2023-10-13 10:53:08,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1360688.0, ans=0.125 2023-10-13 10:53:22,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.704e+02 1.836e+02 2.009e+02 2.865e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-13 10:53:28,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360734.6666666667, ans=0.125 2023-10-13 10:54:22,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1360968.0, ans=0.125 2023-10-13 10:54:31,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1361014.6666666667, ans=0.05 2023-10-13 10:54:40,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1361061.3333333333, ans=0.1 2023-10-13 10:54:43,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=22.5 2023-10-13 10:54:53,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1361108.0, ans=0.125 2023-10-13 10:55:03,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1361154.6666666667, ans=0.025 2023-10-13 10:55:17,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1361201.3333333333, ans=0.0 2023-10-13 10:55:21,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1361201.3333333333, ans=0.2 2023-10-13 10:55:22,616 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.747e+02 1.921e+02 2.111e+02 3.439e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-13 10:55:47,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1361294.6666666667, ans=0.025 2023-10-13 10:55:57,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1361341.3333333333, ans=0.125 2023-10-13 10:56:03,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1361388.0, ans=22.5 2023-10-13 10:56:05,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.52 vs. limit=12.0 2023-10-13 10:56:05,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1361388.0, ans=0.125 2023-10-13 10:56:08,694 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:56:18,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1361434.6666666667, ans=0.1 2023-10-13 10:56:30,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1361481.3333333333, ans=0.125 2023-10-13 10:56:31,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1361481.3333333333, ans=0.125 2023-10-13 10:56:37,280 INFO [train.py:1031] (3/4) Epoch 22, batch 5000, loss[loss=0.1679, simple_loss=0.2332, pruned_loss=0.0513, over 12377.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2802, pruned_loss=0.04889, over 30098278.36 frames. ], batch size: 440, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 10:57:01,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1361621.3333333333, ans=0.0 2023-10-13 10:57:02,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1361621.3333333333, ans=0.5 2023-10-13 10:57:15,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1361668.0, ans=0.035 2023-10-13 10:57:17,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-10-13 10:57:19,767 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.831e+02 1.966e+02 2.188e+02 2.975e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 10:57:25,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1361714.6666666667, ans=0.125 2023-10-13 10:57:38,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1361761.3333333333, ans=0.125 2023-10-13 10:57:46,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-10-13 10:57:59,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1361854.6666666667, ans=0.1 2023-10-13 10:57:59,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1361854.6666666667, ans=0.125 2023-10-13 10:58:23,086 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:58:30,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1361948.0, ans=0.125 2023-10-13 10:58:36,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1361994.6666666667, ans=0.0 2023-10-13 10:58:42,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1361994.6666666667, ans=0.125 2023-10-13 10:58:57,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1362041.3333333333, ans=0.5 2023-10-13 10:58:57,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-13 10:59:17,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.784e+02 2.009e+02 2.283e+02 3.097e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-13 10:59:32,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1362228.0, ans=0.125 2023-10-13 11:00:26,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362414.6666666667, ans=0.1 2023-10-13 11:00:53,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1362508.0, ans=0.125 2023-10-13 11:00:56,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1362554.6666666667, ans=0.125 2023-10-13 11:01:12,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.774e+02 1.922e+02 2.113e+02 3.095e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-13 11:01:12,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-10-13 11:01:57,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362788.0, ans=0.1 2023-10-13 11:02:07,872 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-13 11:02:08,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1362834.6666666667, ans=0.1 2023-10-13 11:02:26,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1362881.3333333333, ans=0.0 2023-10-13 11:02:30,902 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:03:15,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.775e+02 1.988e+02 2.213e+02 2.967e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-13 11:03:32,311 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:03:33,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1363161.3333333333, ans=0.0 2023-10-13 11:03:36,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1363161.3333333333, ans=0.0 2023-10-13 11:03:36,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1363161.3333333333, ans=0.125 2023-10-13 11:03:49,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1363208.0, ans=0.125 2023-10-13 11:03:52,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1363208.0, ans=0.0 2023-10-13 11:04:00,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1363254.6666666667, ans=0.125 2023-10-13 11:04:13,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1363301.3333333333, ans=0.125 2023-10-13 11:04:15,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1363301.3333333333, ans=0.125 2023-10-13 11:04:21,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363348.0, ans=0.1 2023-10-13 11:04:24,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=22.5 2023-10-13 11:04:29,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1363348.0, ans=0.1 2023-10-13 11:04:50,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1363441.3333333333, ans=0.07 2023-10-13 11:04:53,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1363488.0, ans=0.125 2023-10-13 11:04:56,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1363488.0, ans=0.125 2023-10-13 11:04:58,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=12.0 2023-10-13 11:05:02,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1363488.0, ans=0.125 2023-10-13 11:05:10,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.768e+02 2.037e+02 2.248e+02 4.172e+02, threshold=4.075e+02, percent-clipped=1.0 2023-10-13 11:05:13,765 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=8.0 2023-10-13 11:05:23,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1363581.3333333333, ans=0.125 2023-10-13 11:05:28,186 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:05:29,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1363628.0, ans=0.0 2023-10-13 11:05:34,119 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-13 11:05:55,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1363721.3333333333, ans=0.125 2023-10-13 11:06:05,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1363768.0, ans=0.1 2023-10-13 11:06:19,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1363814.6666666667, ans=0.0 2023-10-13 11:06:31,356 INFO [train.py:1031] (3/4) Epoch 22, batch 5500, loss[loss=0.1866, simple_loss=0.2835, pruned_loss=0.04487, over 16883.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2802, pruned_loss=0.04886, over 30713913.03 frames. ], batch size: 77, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 11:06:40,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1363861.3333333333, ans=0.125 2023-10-13 11:06:48,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1363908.0, ans=0.0 2023-10-13 11:06:49,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1363908.0, ans=0.0 2023-10-13 11:06:50,997 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.17 vs. limit=22.5 2023-10-13 11:07:09,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.700e+02 1.858e+02 1.994e+02 3.264e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-13 11:07:17,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1364048.0, ans=0.2 2023-10-13 11:07:48,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1364141.3333333333, ans=0.0 2023-10-13 11:08:21,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1364281.3333333333, ans=0.125 2023-10-13 11:08:21,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.71 vs. limit=22.5 2023-10-13 11:08:29,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1364328.0, ans=0.125 2023-10-13 11:08:35,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2023-10-13 11:09:03,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1364468.0, ans=0.0 2023-10-13 11:09:04,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1364468.0, ans=0.0 2023-10-13 11:09:05,735 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:09:07,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.733e+02 1.854e+02 2.013e+02 2.761e+02, threshold=3.709e+02, percent-clipped=0.0 2023-10-13 11:09:18,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1364514.6666666667, ans=0.0 2023-10-13 11:09:29,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1364561.3333333333, ans=0.2 2023-10-13 11:09:31,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1364561.3333333333, ans=0.125 2023-10-13 11:09:43,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1364608.0, ans=0.125 2023-10-13 11:10:09,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2023-10-13 11:10:13,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1364748.0, ans=0.0 2023-10-13 11:10:17,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1364748.0, ans=0.09899494936611666 2023-10-13 11:10:44,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1364841.3333333333, ans=0.125 2023-10-13 11:10:44,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1364841.3333333333, ans=0.125 2023-10-13 11:10:49,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-10-13 11:10:52,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=22.5 2023-10-13 11:11:04,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-13 11:11:06,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.799e+02 1.977e+02 2.222e+02 3.111e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 11:11:14,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1364981.3333333333, ans=0.125 2023-10-13 11:11:16,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-13 11:11:20,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1364981.3333333333, ans=0.0 2023-10-13 11:11:21,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1364981.3333333333, ans=0.125 2023-10-13 11:11:25,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1365028.0, ans=0.0 2023-10-13 11:11:32,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1365028.0, ans=0.1 2023-10-13 11:11:34,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1365074.6666666667, ans=0.05 2023-10-13 11:11:41,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-13 11:11:50,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1365121.3333333333, ans=0.125 2023-10-13 11:11:52,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365121.3333333333, ans=0.1 2023-10-13 11:12:00,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-13 11:12:23,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1365261.3333333333, ans=0.2 2023-10-13 11:12:41,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365308.0, ans=0.1 2023-10-13 11:12:58,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1365354.6666666667, ans=0.125 2023-10-13 11:13:04,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1365401.3333333333, ans=0.025 2023-10-13 11:13:06,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.783e+02 2.040e+02 2.345e+02 3.210e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-13 11:13:08,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-13 11:13:17,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1365448.0, ans=0.125 2023-10-13 11:13:32,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365494.6666666667, ans=0.1 2023-10-13 11:13:50,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1365588.0, ans=0.125 2023-10-13 11:14:03,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1365634.6666666667, ans=0.125 2023-10-13 11:14:09,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1365634.6666666667, ans=0.125 2023-10-13 11:14:17,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365681.3333333333, ans=0.1 2023-10-13 11:14:27,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1365728.0, ans=0.0 2023-10-13 11:14:29,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1365728.0, ans=0.125 2023-10-13 11:14:29,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1365728.0, ans=0.125 2023-10-13 11:14:36,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1365774.6666666667, ans=0.125 2023-10-13 11:14:49,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1365821.3333333333, ans=0.1 2023-10-13 11:15:10,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1365868.0, ans=0.125 2023-10-13 11:15:10,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.789e+02 1.891e+02 2.108e+02 2.744e+02, threshold=3.783e+02, percent-clipped=0.0 2023-10-13 11:15:35,160 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.69 vs. limit=10.0 2023-10-13 11:15:35,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1366008.0, ans=0.125 2023-10-13 11:15:37,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1366008.0, ans=0.125 2023-10-13 11:15:37,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1366008.0, ans=0.0 2023-10-13 11:15:52,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1366054.6666666667, ans=0.05 2023-10-13 11:16:08,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1366101.3333333333, ans=0.125 2023-10-13 11:16:14,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1366148.0, ans=0.2 2023-10-13 11:16:21,911 INFO [train.py:1031] (3/4) Epoch 22, batch 6000, loss[loss=0.1768, simple_loss=0.2404, pruned_loss=0.05666, over 12513.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2804, pruned_loss=0.04927, over 31129659.75 frames. ], batch size: 440, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 11:17:06,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.799e+02 1.953e+02 2.162e+02 2.821e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-13 11:17:11,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1366381.3333333333, ans=0.125 2023-10-13 11:17:34,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1366474.6666666667, ans=0.125 2023-10-13 11:17:50,100 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.09 vs. limit=15.0 2023-10-13 11:18:10,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1366614.6666666667, ans=0.125 2023-10-13 11:18:14,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-13 11:18:26,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1366661.3333333333, ans=0.125 2023-10-13 11:18:32,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1366661.3333333333, ans=0.0 2023-10-13 11:18:35,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1366708.0, ans=0.0 2023-10-13 11:18:39,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1366708.0, ans=0.125 2023-10-13 11:18:41,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1366708.0, ans=0.2 2023-10-13 11:18:46,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1366754.6666666667, ans=0.07 2023-10-13 11:18:54,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-13 11:19:05,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.821e+02 2.015e+02 2.202e+02 2.685e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-13 11:19:20,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366894.6666666667, ans=0.1 2023-10-13 11:19:32,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1366941.3333333333, ans=0.0 2023-10-13 11:19:37,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1366941.3333333333, ans=0.125 2023-10-13 11:21:04,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1367268.0, ans=0.125 2023-10-13 11:21:09,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.00 vs. limit=22.5 2023-10-13 11:21:10,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.780e+02 1.932e+02 2.145e+02 2.662e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 11:21:38,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1367408.0, ans=0.125 2023-10-13 11:21:42,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1367408.0, ans=0.2 2023-10-13 11:21:48,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1367408.0, ans=0.0 2023-10-13 11:22:09,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367501.3333333333, ans=0.1 2023-10-13 11:22:10,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1367501.3333333333, ans=0.125 2023-10-13 11:22:19,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1367548.0, ans=0.0 2023-10-13 11:22:28,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367594.6666666667, ans=0.1 2023-10-13 11:22:33,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1367594.6666666667, ans=0.5 2023-10-13 11:22:41,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.79 vs. limit=15.0 2023-10-13 11:22:45,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367641.3333333333, ans=0.1 2023-10-13 11:22:57,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367688.0, ans=0.1 2023-10-13 11:23:03,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1367688.0, ans=0.2 2023-10-13 11:23:14,873 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.793e+02 2.039e+02 2.250e+02 3.260e+02, threshold=4.079e+02, percent-clipped=0.0 2023-10-13 11:23:18,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1367781.3333333333, ans=0.125 2023-10-13 11:23:48,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1367874.6666666667, ans=0.0 2023-10-13 11:24:19,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1367968.0, ans=0.125 2023-10-13 11:24:34,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1368014.6666666667, ans=0.0 2023-10-13 11:24:40,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1368061.3333333333, ans=0.07 2023-10-13 11:25:19,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.699e+02 1.820e+02 1.961e+02 2.790e+02, threshold=3.639e+02, percent-clipped=0.0 2023-10-13 11:25:34,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1368248.0, ans=0.125 2023-10-13 11:25:46,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1368341.3333333333, ans=0.125 2023-10-13 11:25:51,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1368341.3333333333, ans=0.0 2023-10-13 11:25:51,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-10-13 11:25:54,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2023-10-13 11:26:00,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1368388.0, ans=0.0 2023-10-13 11:26:16,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1368434.6666666667, ans=0.0 2023-10-13 11:26:17,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.53 vs. limit=15.0 2023-10-13 11:26:35,051 INFO [train.py:1031] (3/4) Epoch 22, batch 6500, loss[loss=0.1725, simple_loss=0.2692, pruned_loss=0.03785, over 16889.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2809, pruned_loss=0.04945, over 31480984.55 frames. ], batch size: 72, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 11:26:45,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1368528.0, ans=0.0 2023-10-13 11:26:59,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-13 11:27:06,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1368621.3333333333, ans=0.1 2023-10-13 11:27:10,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1368621.3333333333, ans=0.125 2023-10-13 11:27:18,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1368668.0, ans=0.125 2023-10-13 11:27:31,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.795e+02 2.010e+02 2.225e+02 2.848e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-13 11:27:42,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1368714.6666666667, ans=0.125 2023-10-13 11:27:42,701 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-13 11:27:52,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-10-13 11:27:52,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=12.0 2023-10-13 11:28:13,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-10-13 11:28:20,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1368854.6666666667, ans=0.05 2023-10-13 11:28:35,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1368901.3333333333, ans=0.1 2023-10-13 11:28:37,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1368948.0, ans=0.125 2023-10-13 11:28:43,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1368948.0, ans=0.125 2023-10-13 11:28:58,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1368994.6666666667, ans=0.0 2023-10-13 11:29:07,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1369041.3333333333, ans=0.125 2023-10-13 11:29:29,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1369134.6666666667, ans=0.0 2023-10-13 11:29:36,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.876e+02 2.013e+02 2.178e+02 2.772e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-13 11:29:47,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1369181.3333333333, ans=0.125 2023-10-13 11:29:58,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1369228.0, ans=0.0 2023-10-13 11:30:00,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=15.0 2023-10-13 11:30:37,484 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:30:50,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1369461.3333333333, ans=0.04949747468305833 2023-10-13 11:30:51,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1369461.3333333333, ans=0.125 2023-10-13 11:30:52,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-10-13 11:31:03,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-10-13 11:31:04,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1369508.0, ans=0.0 2023-10-13 11:31:05,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.85 vs. limit=6.0 2023-10-13 11:31:30,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.820e+02 1.974e+02 2.279e+02 3.045e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-13 11:31:36,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1369648.0, ans=0.125 2023-10-13 11:31:50,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1369694.6666666667, ans=0.125 2023-10-13 11:32:14,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1369788.0, ans=0.0 2023-10-13 11:32:15,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.55 vs. limit=5.0 2023-10-13 11:32:19,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1369788.0, ans=0.0 2023-10-13 11:32:20,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1369834.6666666667, ans=0.125 2023-10-13 11:32:29,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1369834.6666666667, ans=10.0 2023-10-13 11:32:29,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369834.6666666667, ans=0.1 2023-10-13 11:32:42,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-10-13 11:32:57,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1369928.0, ans=0.0 2023-10-13 11:33:47,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1370068.0, ans=0.05 2023-10-13 11:33:48,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.701e+02 1.855e+02 2.110e+02 2.887e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-13 11:33:50,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1370114.6666666667, ans=0.2 2023-10-13 11:33:52,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1370114.6666666667, ans=0.125 2023-10-13 11:34:04,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1370161.3333333333, ans=0.2 2023-10-13 11:34:30,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1370254.6666666667, ans=0.125 2023-10-13 11:35:00,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1370348.0, ans=0.2 2023-10-13 11:35:07,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1370394.6666666667, ans=0.1 2023-10-13 11:35:19,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1370441.3333333333, ans=0.125 2023-10-13 11:35:32,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1370488.0, ans=0.125 2023-10-13 11:35:37,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1370488.0, ans=0.125 2023-10-13 11:35:42,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1370534.6666666667, ans=0.0 2023-10-13 11:35:51,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.692e+02 1.895e+02 2.144e+02 2.983e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-13 11:36:20,260 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-10-13 11:36:22,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1370674.6666666667, ans=0.0 2023-10-13 11:36:29,730 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:36:52,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1370814.6666666667, ans=0.0 2023-10-13 11:36:53,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1370814.6666666667, ans=0.0 2023-10-13 11:37:03,933 INFO [train.py:1031] (3/4) Epoch 22, batch 7000, loss[loss=0.1952, simple_loss=0.2902, pruned_loss=0.05007, over 16547.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2815, pruned_loss=0.0493, over 31813384.96 frames. ], batch size: 61, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 11:37:16,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1370861.3333333333, ans=0.05 2023-10-13 11:37:25,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1370908.0, ans=0.0 2023-10-13 11:37:44,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=22.5 2023-10-13 11:37:50,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=22.5 2023-10-13 11:37:54,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.949e+02 2.157e+02 2.372e+02 3.551e+02, threshold=4.314e+02, percent-clipped=0.0 2023-10-13 11:37:55,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1371048.0, ans=0.0 2023-10-13 11:37:56,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1371048.0, ans=0.1 2023-10-13 11:37:58,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1371048.0, ans=0.125 2023-10-13 11:38:13,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1371094.6666666667, ans=0.2 2023-10-13 11:38:14,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-10-13 11:38:14,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1371094.6666666667, ans=0.125 2023-10-13 11:38:16,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1371094.6666666667, ans=0.1 2023-10-13 11:38:40,494 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:38:57,320 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:38:58,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1371281.3333333333, ans=0.0 2023-10-13 11:38:59,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1371281.3333333333, ans=0.0 2023-10-13 11:39:26,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1371374.6666666667, ans=0.0 2023-10-13 11:39:27,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1371421.3333333333, ans=0.125 2023-10-13 11:39:39,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1371421.3333333333, ans=0.0 2023-10-13 11:39:53,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.914e+02 2.082e+02 2.363e+02 3.072e+02, threshold=4.164e+02, percent-clipped=0.0 2023-10-13 11:40:01,098 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:40:12,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371561.3333333333, ans=0.125 2023-10-13 11:40:13,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371561.3333333333, ans=0.1 2023-10-13 11:40:18,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.28 vs. limit=12.0 2023-10-13 11:40:19,069 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=22.5 2023-10-13 11:40:35,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1371654.6666666667, ans=0.0 2023-10-13 11:40:40,810 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:40:40,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1371701.3333333333, ans=0.07 2023-10-13 11:40:46,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1371701.3333333333, ans=0.125 2023-10-13 11:40:47,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1371701.3333333333, ans=0.0 2023-10-13 11:41:10,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1371794.6666666667, ans=0.125 2023-10-13 11:41:27,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1371841.3333333333, ans=0.125 2023-10-13 11:41:45,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1371888.0, ans=0.125 2023-10-13 11:41:46,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1371888.0, ans=0.125 2023-10-13 11:42:02,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1371934.6666666667, ans=0.2 2023-10-13 11:42:07,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.738e+02 1.893e+02 2.098e+02 2.690e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-13 11:42:15,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371981.3333333333, ans=0.1 2023-10-13 11:42:36,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1372074.6666666667, ans=0.125 2023-10-13 11:42:39,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-10-13 11:42:50,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1372121.3333333333, ans=0.125 2023-10-13 11:42:55,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1372121.3333333333, ans=0.125 2023-10-13 11:42:59,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1372168.0, ans=0.2 2023-10-13 11:43:04,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1372168.0, ans=0.125 2023-10-13 11:43:15,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1372214.6666666667, ans=0.125 2023-10-13 11:43:18,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1372214.6666666667, ans=0.125 2023-10-13 11:43:37,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1372261.3333333333, ans=0.2 2023-10-13 11:43:49,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1372308.0, ans=0.07 2023-10-13 11:44:06,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-13 11:44:18,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=12.0 2023-10-13 11:44:20,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.774e+02 1.976e+02 2.251e+02 3.269e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 11:44:22,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1372448.0, ans=0.125 2023-10-13 11:44:33,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1372494.6666666667, ans=0.125 2023-10-13 11:44:34,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1372494.6666666667, ans=0.125 2023-10-13 11:44:36,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1372494.6666666667, ans=0.0 2023-10-13 11:44:48,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1372541.3333333333, ans=0.125 2023-10-13 11:44:54,575 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.24 vs. limit=15.0 2023-10-13 11:45:09,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1372588.0, ans=0.04949747468305833 2023-10-13 11:45:12,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1372588.0, ans=0.125 2023-10-13 11:45:27,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1372681.3333333333, ans=0.125 2023-10-13 11:45:43,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1372728.0, ans=0.125 2023-10-13 11:46:06,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1372821.3333333333, ans=0.0 2023-10-13 11:46:07,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372821.3333333333, ans=0.1 2023-10-13 11:46:14,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1372868.0, ans=0.125 2023-10-13 11:46:17,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1372868.0, ans=0.1 2023-10-13 11:46:24,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.847e+02 2.094e+02 2.480e+02 3.545e+02, threshold=4.188e+02, percent-clipped=0.0 2023-10-13 11:46:25,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1372914.6666666667, ans=0.0 2023-10-13 11:46:29,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1372914.6666666667, ans=0.2 2023-10-13 11:46:57,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1373008.0, ans=0.0 2023-10-13 11:47:02,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1373054.6666666667, ans=0.0 2023-10-13 11:47:02,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1373054.6666666667, ans=0.0 2023-10-13 11:47:12,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1373101.3333333333, ans=0.0 2023-10-13 11:47:23,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1373148.0, ans=0.0 2023-10-13 11:47:35,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.43 vs. limit=15.0 2023-10-13 11:47:37,293 INFO [train.py:1031] (3/4) Epoch 22, batch 7500, loss[loss=0.1889, simple_loss=0.2766, pruned_loss=0.05058, over 16527.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2814, pruned_loss=0.04928, over 32020431.91 frames. ], batch size: 241, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 11:47:46,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1373194.6666666667, ans=0.07 2023-10-13 11:47:51,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1373241.3333333333, ans=0.125 2023-10-13 11:48:10,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373334.6666666667, ans=0.1 2023-10-13 11:48:21,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1373381.3333333333, ans=0.2 2023-10-13 11:48:22,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.778e+02 1.957e+02 2.210e+02 3.249e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 11:48:51,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1373474.6666666667, ans=0.0 2023-10-13 11:49:15,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1373568.0, ans=0.0 2023-10-13 11:49:33,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.85 vs. limit=15.0 2023-10-13 11:49:37,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1373661.3333333333, ans=0.0 2023-10-13 11:49:56,443 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:50:12,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1373801.3333333333, ans=0.125 2023-10-13 11:50:30,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.791e+02 1.951e+02 2.143e+02 2.932e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 11:50:32,996 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:50:56,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1373941.3333333333, ans=0.1 2023-10-13 11:51:03,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1373941.3333333333, ans=0.125 2023-10-13 11:51:36,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1374081.3333333333, ans=0.0 2023-10-13 11:51:52,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-13 11:51:59,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.77 vs. limit=15.0 2023-10-13 11:52:01,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1374174.6666666667, ans=0.07 2023-10-13 11:52:01,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1374174.6666666667, ans=0.2 2023-10-13 11:52:25,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1374268.0, ans=0.0 2023-10-13 11:52:28,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1374268.0, ans=0.125 2023-10-13 11:52:29,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.716e+02 1.956e+02 2.273e+02 4.144e+02, threshold=3.912e+02, percent-clipped=1.0 2023-10-13 11:52:33,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1374314.6666666667, ans=0.2 2023-10-13 11:52:35,664 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:52:43,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1374361.3333333333, ans=0.0 2023-10-13 11:52:51,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1374361.3333333333, ans=0.125 2023-10-13 11:53:51,953 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-13 11:54:11,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.17 vs. limit=22.5 2023-10-13 11:54:31,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1374734.6666666667, ans=0.95 2023-10-13 11:54:36,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.775e+02 1.947e+02 2.158e+02 2.896e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 11:54:45,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1374781.3333333333, ans=0.125 2023-10-13 11:54:56,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-13 11:54:57,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1374828.0, ans=0.125 2023-10-13 11:55:11,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1374921.3333333333, ans=0.125 2023-10-13 11:55:12,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374921.3333333333, ans=0.1 2023-10-13 11:55:32,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1374968.0, ans=0.125 2023-10-13 11:55:37,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1375014.6666666667, ans=0.125 2023-10-13 11:55:52,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1375061.3333333333, ans=0.035 2023-10-13 11:56:07,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1375108.0, ans=0.125 2023-10-13 11:56:14,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=15.0 2023-10-13 11:56:38,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1375248.0, ans=0.0 2023-10-13 11:56:39,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.764e+02 1.965e+02 2.377e+02 3.465e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 11:56:42,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1375248.0, ans=0.0 2023-10-13 11:56:50,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1375248.0, ans=0.09899494936611666 2023-10-13 11:56:58,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-10-13 11:57:18,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1375341.3333333333, ans=0.125 2023-10-13 11:57:21,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=12.0 2023-10-13 11:57:29,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1375388.0, ans=0.125 2023-10-13 11:57:50,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=22.5 2023-10-13 11:57:55,399 INFO [train.py:1031] (3/4) Epoch 22, batch 8000, loss[loss=0.1659, simple_loss=0.2635, pruned_loss=0.03413, over 16811.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2807, pruned_loss=0.04872, over 32162521.94 frames. ], batch size: 188, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 11:58:06,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1375574.6666666667, ans=10.0 2023-10-13 11:58:15,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-10-13 11:58:16,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1375574.6666666667, ans=0.125 2023-10-13 11:58:30,885 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-10-13 11:58:35,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1375668.0, ans=0.2 2023-10-13 11:58:42,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.728e+02 1.885e+02 2.076e+02 3.337e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 11:58:42,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=22.5 2023-10-13 11:59:16,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.30 vs. limit=15.0 2023-10-13 11:59:17,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2023-10-13 11:59:33,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.89 vs. limit=6.0 2023-10-13 11:59:35,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1375901.3333333333, ans=0.125 2023-10-13 11:59:43,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=15.0 2023-10-13 11:59:50,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1375994.6666666667, ans=0.0 2023-10-13 11:59:56,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1375994.6666666667, ans=0.0 2023-10-13 12:00:11,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-13 12:00:12,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1376088.0, ans=0.09899494936611666 2023-10-13 12:00:13,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1376088.0, ans=0.1 2023-10-13 12:00:29,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376134.6666666667, ans=0.1 2023-10-13 12:00:29,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1376134.6666666667, ans=0.125 2023-10-13 12:00:35,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.832e+02 2.043e+02 2.283e+02 3.630e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-13 12:00:52,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1376228.0, ans=0.125 2023-10-13 12:00:52,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1376228.0, ans=0.2 2023-10-13 12:00:59,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-13 12:01:17,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1376274.6666666667, ans=0.125 2023-10-13 12:01:46,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1376368.0, ans=0.125 2023-10-13 12:01:57,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1376414.6666666667, ans=0.1 2023-10-13 12:02:22,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1376508.0, ans=0.1 2023-10-13 12:02:38,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.11 vs. limit=15.0 2023-10-13 12:02:57,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.775e+02 1.877e+02 2.043e+02 3.329e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-13 12:03:17,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376741.3333333333, ans=0.1 2023-10-13 12:03:36,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1376788.0, ans=0.05 2023-10-13 12:03:42,365 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=15.0 2023-10-13 12:04:09,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1376928.0, ans=0.125 2023-10-13 12:04:09,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1376928.0, ans=0.2 2023-10-13 12:04:10,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1376928.0, ans=0.125 2023-10-13 12:04:38,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1377021.3333333333, ans=0.125 2023-10-13 12:04:40,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1377021.3333333333, ans=0.125 2023-10-13 12:04:45,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1377068.0, ans=0.0 2023-10-13 12:04:46,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.06 vs. limit=22.5 2023-10-13 12:04:58,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.743e+02 1.964e+02 2.153e+02 4.153e+02, threshold=3.927e+02, percent-clipped=1.0 2023-10-13 12:04:58,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1377114.6666666667, ans=0.0 2023-10-13 12:05:03,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1377114.6666666667, ans=0.125 2023-10-13 12:05:19,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1377208.0, ans=0.125 2023-10-13 12:05:36,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1377254.6666666667, ans=0.125 2023-10-13 12:05:54,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1377348.0, ans=0.125 2023-10-13 12:06:01,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-10-13 12:06:12,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1377394.6666666667, ans=0.1 2023-10-13 12:06:29,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377441.3333333333, ans=0.1 2023-10-13 12:06:51,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1377534.6666666667, ans=0.125 2023-10-13 12:06:51,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1377534.6666666667, ans=0.0 2023-10-13 12:06:55,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1377534.6666666667, ans=0.125 2023-10-13 12:06:56,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1377534.6666666667, ans=0.0 2023-10-13 12:07:02,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.875e+02 2.037e+02 2.268e+02 3.290e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-13 12:07:27,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1377674.6666666667, ans=0.0 2023-10-13 12:07:32,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-13 12:07:33,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1377674.6666666667, ans=0.0 2023-10-13 12:07:56,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1377768.0, ans=0.125 2023-10-13 12:08:00,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1377768.0, ans=0.2 2023-10-13 12:08:20,434 INFO [train.py:1031] (3/4) Epoch 22, batch 8500, loss[loss=0.1973, simple_loss=0.2893, pruned_loss=0.05267, over 16420.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.281, pruned_loss=0.04862, over 32319670.77 frames. ], batch size: 44, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 12:08:25,396 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-10-13 12:08:30,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1377861.3333333333, ans=0.125 2023-10-13 12:09:10,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.798e+02 1.973e+02 2.208e+02 3.254e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-13 12:09:13,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1378048.0, ans=0.125 2023-10-13 12:09:13,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=22.5 2023-10-13 12:09:29,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1378094.6666666667, ans=0.125 2023-10-13 12:09:32,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-13 12:10:20,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1378281.3333333333, ans=0.0 2023-10-13 12:10:21,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1378281.3333333333, ans=0.125 2023-10-13 12:10:28,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1378328.0, ans=0.125 2023-10-13 12:10:30,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1378328.0, ans=0.1 2023-10-13 12:10:33,337 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:10:54,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1378421.3333333333, ans=0.1 2023-10-13 12:11:09,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1378468.0, ans=0.0 2023-10-13 12:11:20,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.730e+02 1.932e+02 2.176e+02 3.111e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-13 12:11:20,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1378514.6666666667, ans=0.07 2023-10-13 12:11:55,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-10-13 12:12:08,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-10-13 12:12:36,893 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:12:48,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.71 vs. limit=15.0 2023-10-13 12:12:56,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1378841.3333333333, ans=0.2 2023-10-13 12:13:24,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-13 12:13:33,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.704e+02 1.930e+02 2.181e+02 3.007e+02, threshold=3.861e+02, percent-clipped=0.0 2023-10-13 12:13:43,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1379028.0, ans=0.125 2023-10-13 12:13:47,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.91 vs. limit=10.0 2023-10-13 12:13:49,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1379028.0, ans=0.0 2023-10-13 12:13:49,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1379028.0, ans=0.125 2023-10-13 12:14:09,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1379074.6666666667, ans=0.125 2023-10-13 12:14:12,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379074.6666666667, ans=0.1 2023-10-13 12:14:14,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1379121.3333333333, ans=0.0 2023-10-13 12:14:15,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1379121.3333333333, ans=0.125 2023-10-13 12:14:31,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=1379168.0, ans=15.0 2023-10-13 12:14:42,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1379168.0, ans=0.0 2023-10-13 12:15:02,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1379261.3333333333, ans=0.0 2023-10-13 12:15:03,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-13 12:15:09,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1379308.0, ans=0.125 2023-10-13 12:15:22,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1379354.6666666667, ans=0.2 2023-10-13 12:15:31,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1379354.6666666667, ans=15.0 2023-10-13 12:15:34,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-10-13 12:15:43,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1379401.3333333333, ans=0.125 2023-10-13 12:15:45,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1379401.3333333333, ans=0.0 2023-10-13 12:15:52,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.663e+02 1.862e+02 2.041e+02 2.848e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-13 12:16:00,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1379494.6666666667, ans=0.125 2023-10-13 12:16:13,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1379541.3333333333, ans=0.125 2023-10-13 12:16:14,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1379541.3333333333, ans=0.0 2023-10-13 12:16:16,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1379541.3333333333, ans=0.125 2023-10-13 12:16:18,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-10-13 12:16:34,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-10-13 12:16:39,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1379634.6666666667, ans=0.2 2023-10-13 12:17:49,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.786e+02 1.951e+02 2.121e+02 2.852e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 12:18:10,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1380008.0, ans=0.125 2023-10-13 12:18:22,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.62 vs. limit=15.0 2023-10-13 12:18:52,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1380148.0, ans=0.125 2023-10-13 12:18:56,731 INFO [train.py:1031] (3/4) Epoch 22, batch 9000, loss[loss=0.2024, simple_loss=0.2912, pruned_loss=0.05686, over 15988.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2806, pruned_loss=0.04847, over 32434819.24 frames. ], batch size: 296, lr: 1.56e-03, grad_scale: 32.0 2023-10-13 12:19:01,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1380194.6666666667, ans=0.0 2023-10-13 12:19:03,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=22.5 2023-10-13 12:19:18,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1380241.3333333333, ans=0.125 2023-10-13 12:19:28,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1380288.0, ans=0.2 2023-10-13 12:19:30,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.52 vs. limit=5.0 2023-10-13 12:19:31,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1380334.6666666667, ans=0.015 2023-10-13 12:19:44,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1380381.3333333333, ans=0.2 2023-10-13 12:19:47,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.791e+02 1.958e+02 2.140e+02 2.839e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-13 12:20:12,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1380474.6666666667, ans=0.125 2023-10-13 12:20:16,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1380474.6666666667, ans=0.2 2023-10-13 12:20:25,130 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.42 vs. limit=12.0 2023-10-13 12:20:29,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1380568.0, ans=0.0 2023-10-13 12:20:37,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1380568.0, ans=0.125 2023-10-13 12:20:43,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380614.6666666667, ans=0.1 2023-10-13 12:21:09,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1380708.0, ans=0.1 2023-10-13 12:21:14,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1380754.6666666667, ans=0.0 2023-10-13 12:21:14,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1380754.6666666667, ans=0.125 2023-10-13 12:21:21,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1380754.6666666667, ans=0.1 2023-10-13 12:21:27,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380801.3333333333, ans=0.1 2023-10-13 12:21:31,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1380801.3333333333, ans=0.0 2023-10-13 12:21:33,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1380801.3333333333, ans=0.2 2023-10-13 12:21:38,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.726e+02 1.899e+02 2.091e+02 2.641e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-13 12:21:57,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1380941.3333333333, ans=0.1 2023-10-13 12:21:57,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1380941.3333333333, ans=0.125 2023-10-13 12:22:08,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1380988.0, ans=0.125 2023-10-13 12:22:25,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1381034.6666666667, ans=0.125 2023-10-13 12:22:29,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1381081.3333333333, ans=0.2 2023-10-13 12:22:31,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1381081.3333333333, ans=0.5 2023-10-13 12:22:36,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1381081.3333333333, ans=0.125 2023-10-13 12:22:49,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381128.0, ans=0.1 2023-10-13 12:23:07,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1381221.3333333333, ans=0.04949747468305833 2023-10-13 12:23:08,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1381221.3333333333, ans=0.04949747468305833 2023-10-13 12:23:15,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1381268.0, ans=0.125 2023-10-13 12:23:23,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1381314.6666666667, ans=0.2 2023-10-13 12:23:25,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.811e+02 2.014e+02 2.178e+02 3.000e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-13 12:23:38,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1381361.3333333333, ans=0.125 2023-10-13 12:23:40,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1381361.3333333333, ans=0.125 2023-10-13 12:23:44,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1381361.3333333333, ans=0.0 2023-10-13 12:23:50,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-10-13 12:24:02,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.37 vs. limit=10.0 2023-10-13 12:24:18,197 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:24:22,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1381548.0, ans=0.0 2023-10-13 12:24:44,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1381641.3333333333, ans=0.0 2023-10-13 12:24:57,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2023-10-13 12:25:00,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1381688.0, ans=0.125 2023-10-13 12:25:09,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.51 vs. limit=15.0 2023-10-13 12:25:16,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.59 vs. limit=15.0 2023-10-13 12:25:27,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.777e+02 1.930e+02 2.177e+02 2.869e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-13 12:25:36,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1381828.0, ans=0.125 2023-10-13 12:26:07,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1381921.3333333333, ans=0.2 2023-10-13 12:26:12,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-10-13 12:26:28,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1381968.0, ans=0.2 2023-10-13 12:26:52,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1382061.3333333333, ans=0.125 2023-10-13 12:27:00,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=22.5 2023-10-13 12:27:07,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382108.0, ans=0.1 2023-10-13 12:27:12,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1382108.0, ans=0.0 2023-10-13 12:27:23,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1382154.6666666667, ans=0.2 2023-10-13 12:27:42,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1382248.0, ans=0.2 2023-10-13 12:27:44,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.828e+02 1.987e+02 2.304e+02 3.669e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-13 12:27:50,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2023-10-13 12:28:53,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1382481.3333333333, ans=0.0 2023-10-13 12:28:55,521 INFO [train.py:1031] (3/4) Epoch 22, batch 9500, loss[loss=0.1713, simple_loss=0.277, pruned_loss=0.03285, over 16884.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2811, pruned_loss=0.04874, over 32484833.71 frames. ], batch size: 104, lr: 1.56e-03, grad_scale: 32.0 2023-10-13 12:28:59,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.71 vs. limit=10.0 2023-10-13 12:28:59,881 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:29:01,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1382528.0, ans=0.025 2023-10-13 12:29:06,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.06 vs. limit=15.0 2023-10-13 12:29:17,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1382574.6666666667, ans=0.0 2023-10-13 12:29:20,895 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2023-10-13 12:29:34,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1382668.0, ans=0.125 2023-10-13 12:29:35,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1382668.0, ans=0.0 2023-10-13 12:29:47,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.862e+02 2.055e+02 2.288e+02 2.911e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-13 12:30:16,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1382854.6666666667, ans=0.09899494936611666 2023-10-13 12:30:28,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1382901.3333333333, ans=0.125 2023-10-13 12:30:37,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1382901.3333333333, ans=0.125 2023-10-13 12:30:38,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1382901.3333333333, ans=0.0 2023-10-13 12:30:51,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1382994.6666666667, ans=0.125 2023-10-13 12:31:05,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383041.3333333333, ans=0.1 2023-10-13 12:31:25,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1383088.0, ans=0.0 2023-10-13 12:31:30,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1383134.6666666667, ans=0.0 2023-10-13 12:31:32,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-10-13 12:31:35,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1383134.6666666667, ans=0.125 2023-10-13 12:31:35,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1383134.6666666667, ans=0.125 2023-10-13 12:31:37,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=22.5 2023-10-13 12:31:37,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1383134.6666666667, ans=0.1 2023-10-13 12:31:37,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1383134.6666666667, ans=0.1 2023-10-13 12:31:46,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.794e+02 1.928e+02 2.166e+02 3.044e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-13 12:32:25,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1383321.3333333333, ans=0.0 2023-10-13 12:32:27,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.71 vs. limit=22.5 2023-10-13 12:33:35,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1383601.3333333333, ans=0.0 2023-10-13 12:33:37,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1383601.3333333333, ans=0.1 2023-10-13 12:33:51,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.758e+02 1.946e+02 2.173e+02 3.220e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 12:33:55,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1383648.0, ans=0.07 2023-10-13 12:34:01,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1383694.6666666667, ans=0.125 2023-10-13 12:34:01,139 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:34:41,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1383834.6666666667, ans=0.125 2023-10-13 12:34:56,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1383928.0, ans=0.125 2023-10-13 12:34:58,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1383928.0, ans=0.035 2023-10-13 12:35:01,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1383928.0, ans=0.125 2023-10-13 12:35:24,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.10 vs. limit=15.0 2023-10-13 12:35:49,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.746e+02 1.893e+02 2.072e+02 2.744e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-13 12:36:13,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1384208.0, ans=0.125 2023-10-13 12:36:44,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1384348.0, ans=0.125 2023-10-13 12:36:53,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1384394.6666666667, ans=0.0 2023-10-13 12:37:02,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1384394.6666666667, ans=0.2 2023-10-13 12:37:18,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1384488.0, ans=0.0 2023-10-13 12:37:29,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1384534.6666666667, ans=0.125 2023-10-13 12:37:30,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-10-13 12:37:30,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.72 vs. limit=15.0 2023-10-13 12:37:32,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1384534.6666666667, ans=0.025 2023-10-13 12:37:40,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1384581.3333333333, ans=0.125 2023-10-13 12:37:41,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.715e+02 1.819e+02 2.000e+02 2.625e+02, threshold=3.637e+02, percent-clipped=0.0 2023-10-13 12:38:30,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1384814.6666666667, ans=0.0 2023-10-13 12:38:30,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1384814.6666666667, ans=0.0 2023-10-13 12:38:42,969 INFO [train.py:1031] (3/4) Epoch 22, batch 10000, loss[loss=0.1801, simple_loss=0.2737, pruned_loss=0.04323, over 16819.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2802, pruned_loss=0.04843, over 32523409.38 frames. ], batch size: 87, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 12:39:15,371 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.92 vs. limit=15.0 2023-10-13 12:39:32,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1385048.0, ans=0.2 2023-10-13 12:39:32,450 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:39:36,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.783e+02 1.937e+02 2.133e+02 3.265e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 12:40:01,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1385141.3333333333, ans=0.04949747468305833 2023-10-13 12:40:05,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=15.0 2023-10-13 12:40:15,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1385188.0, ans=0.125 2023-10-13 12:40:33,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1385234.6666666667, ans=0.0 2023-10-13 12:40:43,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1385281.3333333333, ans=0.125 2023-10-13 12:40:58,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1385328.0, ans=0.125 2023-10-13 12:41:06,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1385374.6666666667, ans=0.125 2023-10-13 12:41:15,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1385421.3333333333, ans=0.2 2023-10-13 12:41:20,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1385421.3333333333, ans=0.125 2023-10-13 12:41:23,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=22.5 2023-10-13 12:41:45,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.791e+02 1.981e+02 2.268e+02 3.504e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-13 12:41:51,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1385561.3333333333, ans=0.1 2023-10-13 12:41:59,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1385561.3333333333, ans=0.125 2023-10-13 12:42:00,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1385561.3333333333, ans=0.125 2023-10-13 12:42:07,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-10-13 12:42:15,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1385654.6666666667, ans=0.125 2023-10-13 12:42:25,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1385701.3333333333, ans=0.1 2023-10-13 12:42:29,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1385701.3333333333, ans=0.125 2023-10-13 12:42:34,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1385748.0, ans=0.125 2023-10-13 12:42:45,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1385748.0, ans=0.0 2023-10-13 12:42:55,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1385794.6666666667, ans=10.0 2023-10-13 12:43:08,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1385841.3333333333, ans=0.2 2023-10-13 12:43:23,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-13 12:43:37,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1385934.6666666667, ans=0.0 2023-10-13 12:43:49,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.850e+02 1.977e+02 2.167e+02 1.134e+03, threshold=3.954e+02, percent-clipped=1.0 2023-10-13 12:43:49,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1385981.3333333333, ans=0.2 2023-10-13 12:44:10,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1386074.6666666667, ans=0.2 2023-10-13 12:44:15,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1386074.6666666667, ans=0.1 2023-10-13 12:44:36,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1386168.0, ans=0.0 2023-10-13 12:44:51,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1386214.6666666667, ans=0.125 2023-10-13 12:44:54,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1386214.6666666667, ans=0.2 2023-10-13 12:45:17,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-10-13 12:45:21,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=22.5 2023-10-13 12:45:56,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.723e+02 1.882e+02 2.099e+02 3.050e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-13 12:46:02,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1386494.6666666667, ans=0.2 2023-10-13 12:46:03,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1386494.6666666667, ans=0.125 2023-10-13 12:46:05,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-13 12:46:07,944 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:46:25,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-13 12:47:01,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1386681.3333333333, ans=0.0 2023-10-13 12:47:07,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1386728.0, ans=0.1 2023-10-13 12:47:08,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1386728.0, ans=0.125 2023-10-13 12:47:11,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.04 vs. limit=10.0 2023-10-13 12:47:19,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1386774.6666666667, ans=0.125 2023-10-13 12:47:34,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1386821.3333333333, ans=0.0 2023-10-13 12:47:57,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=12.0 2023-10-13 12:48:05,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1386914.6666666667, ans=0.035 2023-10-13 12:48:07,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.727e+02 1.862e+02 2.093e+02 2.878e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-13 12:48:14,466 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:48:21,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1387008.0, ans=0.0 2023-10-13 12:48:34,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1387054.6666666667, ans=0.125 2023-10-13 12:48:39,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1387054.6666666667, ans=0.0 2023-10-13 12:48:51,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1387101.3333333333, ans=0.07 2023-10-13 12:49:06,502 INFO [train.py:1031] (3/4) Epoch 22, batch 10500, loss[loss=0.1809, simple_loss=0.2803, pruned_loss=0.04072, over 16856.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2807, pruned_loss=0.04847, over 32591855.91 frames. ], batch size: 98, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 12:49:20,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1387241.3333333333, ans=0.0 2023-10-13 12:49:25,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1387241.3333333333, ans=0.125 2023-10-13 12:49:37,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.25 vs. limit=10.0 2023-10-13 12:49:42,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1387334.6666666667, ans=0.125 2023-10-13 12:49:50,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1387381.3333333333, ans=0.125 2023-10-13 12:49:54,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.73 vs. limit=15.0 2023-10-13 12:49:57,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.727e+02 1.871e+02 2.053e+02 2.736e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-13 12:49:59,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1387381.3333333333, ans=0.125 2023-10-13 12:50:29,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1387474.6666666667, ans=0.1 2023-10-13 12:50:36,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1387521.3333333333, ans=0.0 2023-10-13 12:50:57,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1387568.0, ans=0.2 2023-10-13 12:50:58,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1387614.6666666667, ans=0.125 2023-10-13 12:51:01,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1387614.6666666667, ans=0.0 2023-10-13 12:51:13,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1387661.3333333333, ans=0.0 2023-10-13 12:51:16,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1387661.3333333333, ans=0.2 2023-10-13 12:51:18,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=15.0 2023-10-13 12:51:26,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1387708.0, ans=0.0 2023-10-13 12:51:30,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1387708.0, ans=0.0 2023-10-13 12:51:39,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1387754.6666666667, ans=0.0 2023-10-13 12:51:41,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1387754.6666666667, ans=0.125 2023-10-13 12:51:45,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1387754.6666666667, ans=0.125 2023-10-13 12:52:00,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.67 vs. limit=10.0 2023-10-13 12:52:04,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1387848.0, ans=0.125 2023-10-13 12:52:06,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.784e+02 1.885e+02 2.137e+02 2.709e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-13 12:52:12,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1387894.6666666667, ans=0.5 2023-10-13 12:52:29,919 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:52:48,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1388034.6666666667, ans=0.125 2023-10-13 12:52:50,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-10-13 12:53:04,803 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:53:04,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388081.3333333333, ans=0.1 2023-10-13 12:53:59,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1388268.0, ans=0.125 2023-10-13 12:54:01,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1388268.0, ans=0.125 2023-10-13 12:54:10,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.818e+02 2.018e+02 2.307e+02 3.203e+02, threshold=4.035e+02, percent-clipped=0.0 2023-10-13 12:54:15,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1388361.3333333333, ans=0.1 2023-10-13 12:54:15,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1388361.3333333333, ans=0.125 2023-10-13 12:55:10,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1388548.0, ans=0.125 2023-10-13 12:55:21,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1388594.6666666667, ans=0.1 2023-10-13 12:55:26,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1388641.3333333333, ans=0.2 2023-10-13 12:55:27,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1388641.3333333333, ans=0.0 2023-10-13 12:55:31,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1388641.3333333333, ans=0.2 2023-10-13 12:55:31,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1388641.3333333333, ans=0.125 2023-10-13 12:55:38,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1388688.0, ans=0.125 2023-10-13 12:55:48,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1388734.6666666667, ans=0.125 2023-10-13 12:56:08,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.796e+02 2.061e+02 2.325e+02 3.179e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-13 12:56:12,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2023-10-13 12:56:49,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1388968.0, ans=0.1 2023-10-13 12:56:52,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1388968.0, ans=0.025 2023-10-13 12:57:05,656 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:57:14,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1389061.3333333333, ans=0.125 2023-10-13 12:57:23,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1389108.0, ans=0.125 2023-10-13 12:57:30,549 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-10-13 12:57:36,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1389154.6666666667, ans=0.125 2023-10-13 12:58:03,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1389248.0, ans=10.0 2023-10-13 12:58:05,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.684e+02 1.875e+02 2.138e+02 2.961e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 12:58:09,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-10-13 12:58:14,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1389294.6666666667, ans=0.0 2023-10-13 12:58:27,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1389341.3333333333, ans=0.125 2023-10-13 12:58:50,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1389434.6666666667, ans=0.0 2023-10-13 12:59:01,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=15.0 2023-10-13 12:59:09,920 INFO [train.py:1031] (3/4) Epoch 22, batch 11000, loss[loss=0.2058, simple_loss=0.3011, pruned_loss=0.05521, over 16821.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2808, pruned_loss=0.04859, over 32621342.41 frames. ], batch size: 155, lr: 1.56e-03, grad_scale: 8.0 2023-10-13 12:59:21,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1389574.6666666667, ans=0.2 2023-10-13 12:59:37,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1389621.3333333333, ans=0.2 2023-10-13 13:00:08,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.797e+02 1.991e+02 2.175e+02 2.656e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-13 13:00:32,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1389808.0, ans=0.125 2023-10-13 13:00:50,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1389901.3333333333, ans=0.125 2023-10-13 13:01:09,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=15.0 2023-10-13 13:01:26,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1389994.6666666667, ans=0.1 2023-10-13 13:01:27,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1389994.6666666667, ans=0.1 2023-10-13 13:01:37,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1390041.3333333333, ans=0.1 2023-10-13 13:01:46,692 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-10-13 13:01:56,854 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-13 13:02:06,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1390134.6666666667, ans=0.2 2023-10-13 13:02:08,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1390181.3333333333, ans=0.125 2023-10-13 13:02:15,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.77 vs. limit=10.0 2023-10-13 13:02:19,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.746e+02 1.993e+02 2.338e+02 3.731e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 13:02:24,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1390228.0, ans=0.0 2023-10-13 13:02:24,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1390228.0, ans=0.125 2023-10-13 13:02:26,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1390228.0, ans=0.125 2023-10-13 13:02:53,009 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.594e-03 2023-10-13 13:02:53,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1390321.3333333333, ans=0.5 2023-10-13 13:03:24,856 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=12.0 2023-10-13 13:03:25,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1390414.6666666667, ans=0.0 2023-10-13 13:03:32,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1390461.3333333333, ans=0.5 2023-10-13 13:03:43,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1390508.0, ans=0.0 2023-10-13 13:04:04,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1390601.3333333333, ans=0.125 2023-10-13 13:04:22,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.706e+02 1.875e+02 2.085e+02 2.870e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 13:04:32,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=15.0 2023-10-13 13:04:36,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1390741.3333333333, ans=0.1 2023-10-13 13:04:41,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=22.5 2023-10-13 13:05:03,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-10-13 13:05:14,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1390834.6666666667, ans=0.125 2023-10-13 13:05:36,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1390881.3333333333, ans=0.2 2023-10-13 13:05:36,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1390881.3333333333, ans=0.125 2023-10-13 13:05:37,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1390928.0, ans=0.125 2023-10-13 13:06:04,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1391021.3333333333, ans=10.0 2023-10-13 13:06:05,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.45 vs. limit=12.0 2023-10-13 13:06:22,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1391068.0, ans=0.2 2023-10-13 13:06:37,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1391114.6666666667, ans=0.0 2023-10-13 13:06:37,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.712e+02 1.890e+02 2.236e+02 3.127e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-13 13:06:55,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1391208.0, ans=0.0 2023-10-13 13:06:55,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1391208.0, ans=0.0 2023-10-13 13:07:06,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=22.5 2023-10-13 13:07:09,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1391254.6666666667, ans=0.125 2023-10-13 13:07:22,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-10-13 13:07:47,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1391394.6666666667, ans=0.0 2023-10-13 13:08:07,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1391488.0, ans=0.1 2023-10-13 13:08:30,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1391581.3333333333, ans=0.0 2023-10-13 13:08:35,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1391581.3333333333, ans=0.0 2023-10-13 13:08:41,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1391581.3333333333, ans=0.125 2023-10-13 13:08:44,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.794e+02 1.956e+02 2.153e+02 2.781e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-13 13:09:00,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1391674.6666666667, ans=0.0 2023-10-13 13:09:04,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391674.6666666667, ans=0.1 2023-10-13 13:09:10,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1391721.3333333333, ans=0.125 2023-10-13 13:09:13,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1391721.3333333333, ans=0.125 2023-10-13 13:09:32,542 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=12.0 2023-10-13 13:09:47,534 INFO [train.py:1031] (3/4) Epoch 22, batch 11500, loss[loss=0.1967, simple_loss=0.2991, pruned_loss=0.04713, over 16923.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2806, pruned_loss=0.0485, over 32664956.16 frames. ], batch size: 138, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 13:10:04,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1391908.0, ans=0.1 2023-10-13 13:10:11,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1391954.6666666667, ans=0.125 2023-10-13 13:10:15,460 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2023-10-13 13:10:25,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1392001.3333333333, ans=0.2 2023-10-13 13:10:35,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1392048.0, ans=0.125 2023-10-13 13:10:38,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.32 vs. limit=12.0 2023-10-13 13:10:44,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.768e+02 1.954e+02 2.200e+02 3.243e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 13:11:07,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1392141.3333333333, ans=0.125 2023-10-13 13:11:08,510 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:11:08,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.18 vs. limit=22.5 2023-10-13 13:11:54,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1392328.0, ans=0.125 2023-10-13 13:12:14,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1392374.6666666667, ans=0.125 2023-10-13 13:12:49,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.763e+02 2.002e+02 2.307e+02 3.774e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-13 13:12:59,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1392561.3333333333, ans=0.125 2023-10-13 13:13:10,732 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.66 vs. limit=22.5 2023-10-13 13:13:18,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1392654.6666666667, ans=0.125 2023-10-13 13:13:27,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1392701.3333333333, ans=0.125 2023-10-13 13:13:32,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=15.0 2023-10-13 13:13:46,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1392794.6666666667, ans=0.125 2023-10-13 13:14:01,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1392841.3333333333, ans=0.125 2023-10-13 13:14:45,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1392981.3333333333, ans=0.125 2023-10-13 13:14:48,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.870e+02 2.107e+02 2.456e+02 3.402e+02, threshold=4.215e+02, percent-clipped=0.0 2023-10-13 13:15:38,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.11 vs. limit=22.5 2023-10-13 13:15:45,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1393168.0, ans=0.125 2023-10-13 13:15:52,260 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-13 13:16:17,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1393308.0, ans=0.0 2023-10-13 13:16:28,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1393308.0, ans=0.125 2023-10-13 13:16:43,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1393401.3333333333, ans=0.125 2023-10-13 13:16:44,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1393401.3333333333, ans=0.0 2023-10-13 13:17:01,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1393448.0, ans=0.2 2023-10-13 13:17:05,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.746e+02 1.911e+02 2.144e+02 3.347e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 13:17:17,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1393494.6666666667, ans=0.125 2023-10-13 13:17:19,599 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=22.5 2023-10-13 13:17:26,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-10-13 13:18:01,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1393681.3333333333, ans=0.125 2023-10-13 13:18:01,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1393681.3333333333, ans=0.1 2023-10-13 13:18:02,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-10-13 13:18:19,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-10-13 13:18:26,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393774.6666666667, ans=0.1 2023-10-13 13:18:38,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1393821.3333333333, ans=0.1 2023-10-13 13:18:47,060 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.57 vs. limit=22.5 2023-10-13 13:18:52,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1393868.0, ans=0.125 2023-10-13 13:19:00,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1393868.0, ans=0.0 2023-10-13 13:19:14,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.854e+02 2.030e+02 2.223e+02 3.205e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-13 13:19:20,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1393961.3333333333, ans=0.125 2023-10-13 13:19:31,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1394008.0, ans=0.125 2023-10-13 13:19:32,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1394008.0, ans=0.1 2023-10-13 13:19:35,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1394008.0, ans=0.1 2023-10-13 13:19:38,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1394008.0, ans=0.2 2023-10-13 13:20:01,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1394101.3333333333, ans=0.0 2023-10-13 13:20:02,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1394101.3333333333, ans=0.1 2023-10-13 13:20:04,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1394148.0, ans=0.0 2023-10-13 13:20:16,321 INFO [train.py:1031] (3/4) Epoch 22, batch 12000, loss[loss=0.1868, simple_loss=0.2786, pruned_loss=0.04751, over 16009.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2807, pruned_loss=0.04827, over 32678823.04 frames. ], batch size: 43, lr: 1.55e-03, grad_scale: 32.0 2023-10-13 13:20:23,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1394194.6666666667, ans=0.0 2023-10-13 13:21:07,797 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:21:08,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.96 vs. limit=15.0 2023-10-13 13:21:14,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.790e+02 1.932e+02 2.111e+02 2.818e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 13:21:26,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1394428.0, ans=0.0 2023-10-13 13:21:30,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1394474.6666666667, ans=0.0 2023-10-13 13:22:05,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1394614.6666666667, ans=0.0 2023-10-13 13:22:10,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.60 vs. limit=15.0 2023-10-13 13:22:12,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1394614.6666666667, ans=0.0 2023-10-13 13:22:21,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1394661.3333333333, ans=0.125 2023-10-13 13:22:40,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1394754.6666666667, ans=0.0 2023-10-13 13:22:45,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1394754.6666666667, ans=0.125 2023-10-13 13:23:10,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.722e+02 1.945e+02 2.246e+02 3.327e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 13:23:19,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1394894.6666666667, ans=0.1 2023-10-13 13:23:23,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1394894.6666666667, ans=0.0 2023-10-13 13:23:35,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1394941.3333333333, ans=0.2 2023-10-13 13:23:46,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1394988.0, ans=0.125 2023-10-13 13:23:47,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1394988.0, ans=0.125 2023-10-13 13:23:52,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-10-13 13:23:55,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395034.6666666667, ans=0.1 2023-10-13 13:23:55,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-10-13 13:23:58,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.20 vs. limit=10.0 2023-10-13 13:24:09,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1395081.3333333333, ans=0.025 2023-10-13 13:24:53,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1395268.0, ans=0.125 2023-10-13 13:24:55,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1395268.0, ans=0.2 2023-10-13 13:25:05,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1395314.6666666667, ans=0.1 2023-10-13 13:25:05,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1395314.6666666667, ans=0.0 2023-10-13 13:25:06,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1395314.6666666667, ans=0.125 2023-10-13 13:25:13,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.920e+02 2.061e+02 2.293e+02 5.987e+02, threshold=4.122e+02, percent-clipped=1.0 2023-10-13 13:25:18,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1395361.3333333333, ans=0.125 2023-10-13 13:25:35,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1395454.6666666667, ans=0.125 2023-10-13 13:25:44,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1395454.6666666667, ans=0.0 2023-10-13 13:25:48,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1395501.3333333333, ans=0.0 2023-10-13 13:25:58,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1395501.3333333333, ans=0.1 2023-10-13 13:26:31,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1395641.3333333333, ans=0.125 2023-10-13 13:26:32,365 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-10-13 13:27:05,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.33 vs. limit=22.5 2023-10-13 13:27:07,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.786e+02 2.021e+02 2.384e+02 3.580e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-13 13:27:32,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1395874.6666666667, ans=0.125 2023-10-13 13:28:26,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1396108.0, ans=0.07 2023-10-13 13:28:36,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1396154.6666666667, ans=0.0 2023-10-13 13:28:39,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1396154.6666666667, ans=0.2 2023-10-13 13:28:56,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1396201.3333333333, ans=0.1 2023-10-13 13:29:02,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396248.0, ans=0.1 2023-10-13 13:29:10,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396248.0, ans=0.1 2023-10-13 13:29:14,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.823e+02 2.014e+02 2.328e+02 3.441e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-13 13:29:17,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-10-13 13:29:24,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1396294.6666666667, ans=0.1 2023-10-13 13:29:24,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1396294.6666666667, ans=0.0 2023-10-13 13:29:25,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1396341.3333333333, ans=0.125 2023-10-13 13:30:00,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1396434.6666666667, ans=0.125 2023-10-13 13:30:16,243 INFO [train.py:1031] (3/4) Epoch 22, batch 12500, loss[loss=0.1864, simple_loss=0.2805, pruned_loss=0.04611, over 16847.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2803, pruned_loss=0.04826, over 32712030.59 frames. ], batch size: 87, lr: 1.55e-03, grad_scale: 32.0 2023-10-13 13:30:16,972 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2023-10-13 13:30:48,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1396621.3333333333, ans=0.125 2023-10-13 13:31:13,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1396714.6666666667, ans=0.2 2023-10-13 13:31:17,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.752e+02 1.864e+02 2.137e+02 2.634e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-13 13:31:31,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1396808.0, ans=0.125 2023-10-13 13:31:40,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1396854.6666666667, ans=0.0 2023-10-13 13:31:57,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1396901.3333333333, ans=0.5 2023-10-13 13:32:18,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.49 vs. limit=15.0 2023-10-13 13:32:18,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1396994.6666666667, ans=0.1 2023-10-13 13:32:34,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1397041.3333333333, ans=0.1 2023-10-13 13:33:03,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.87 vs. limit=15.0 2023-10-13 13:33:10,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1397181.3333333333, ans=0.5 2023-10-13 13:33:16,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.753e+02 1.913e+02 2.099e+02 2.465e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-13 13:33:25,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1397228.0, ans=0.0 2023-10-13 13:33:26,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1397228.0, ans=0.125 2023-10-13 13:33:41,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-13 13:33:56,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1397368.0, ans=0.0 2023-10-13 13:34:12,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1397414.6666666667, ans=0.125 2023-10-13 13:34:32,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-10-13 13:34:36,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1397508.0, ans=0.125 2023-10-13 13:34:37,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1397508.0, ans=0.0 2023-10-13 13:34:44,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.01 vs. limit=15.0 2023-10-13 13:34:49,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1397554.6666666667, ans=0.125 2023-10-13 13:35:01,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1397601.3333333333, ans=0.0 2023-10-13 13:35:08,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=15.0 2023-10-13 13:35:20,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.748e+02 1.921e+02 2.149e+02 3.259e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-13 13:35:26,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1397694.6666666667, ans=0.0 2023-10-13 13:35:37,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1397741.3333333333, ans=0.125 2023-10-13 13:35:50,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1397788.0, ans=0.125 2023-10-13 13:36:08,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1397881.3333333333, ans=0.125 2023-10-13 13:36:16,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1397881.3333333333, ans=0.0 2023-10-13 13:36:21,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397928.0, ans=0.1 2023-10-13 13:36:23,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1397928.0, ans=0.0 2023-10-13 13:36:33,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1397974.6666666667, ans=0.0 2023-10-13 13:37:04,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398068.0, ans=0.1 2023-10-13 13:37:13,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1398114.6666666667, ans=0.0 2023-10-13 13:37:13,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1398114.6666666667, ans=0.1 2023-10-13 13:37:20,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.805e+02 2.046e+02 2.246e+02 3.044e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-13 13:37:31,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.25 vs. limit=6.0 2023-10-13 13:38:01,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1398301.3333333333, ans=0.2 2023-10-13 13:38:10,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1398348.0, ans=0.2 2023-10-13 13:38:14,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1398348.0, ans=0.125 2023-10-13 13:38:30,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1398394.6666666667, ans=0.125 2023-10-13 13:38:35,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.76 vs. limit=15.0 2023-10-13 13:38:59,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1398534.6666666667, ans=0.2 2023-10-13 13:39:09,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1398581.3333333333, ans=0.1 2023-10-13 13:39:20,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.727e+02 1.864e+02 2.167e+02 2.849e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-13 13:39:30,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1398674.6666666667, ans=0.0 2023-10-13 13:39:49,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1398721.3333333333, ans=0.125 2023-10-13 13:40:08,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1398814.6666666667, ans=0.125 2023-10-13 13:40:17,200 INFO [train.py:1031] (3/4) Epoch 22, batch 13000, loss[loss=0.1867, simple_loss=0.2765, pruned_loss=0.04841, over 16553.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.281, pruned_loss=0.04844, over 32745244.31 frames. ], batch size: 241, lr: 1.55e-03, grad_scale: 16.0 2023-10-13 13:40:17,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1398861.3333333333, ans=0.125 2023-10-13 13:40:47,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1398954.6666666667, ans=0.1 2023-10-13 13:40:49,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1398954.6666666667, ans=0.125 2023-10-13 13:41:08,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1399001.3333333333, ans=0.09899494936611666 2023-10-13 13:41:10,842 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:41:11,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.08 vs. limit=15.0 2023-10-13 13:41:17,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1399048.0, ans=0.2 2023-10-13 13:41:30,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.800e+02 1.954e+02 2.071e+02 3.098e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 13:41:48,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1399141.3333333333, ans=0.95 2023-10-13 13:41:56,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399188.0, ans=0.1 2023-10-13 13:42:17,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1399234.6666666667, ans=0.1 2023-10-13 13:42:39,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1399328.0, ans=0.125 2023-10-13 13:43:13,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-13 13:43:25,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1399514.6666666667, ans=0.025 2023-10-13 13:43:27,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1399561.3333333333, ans=0.0 2023-10-13 13:43:30,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.806e+02 1.982e+02 2.182e+02 3.367e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 13:43:31,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1399561.3333333333, ans=0.0 2023-10-13 13:43:37,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1399561.3333333333, ans=0.125 2023-10-13 13:43:40,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.62 vs. limit=15.0 2023-10-13 13:43:46,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1399608.0, ans=0.125 2023-10-13 13:43:50,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1399608.0, ans=0.125 2023-10-13 13:44:11,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.35 vs. limit=10.0 2023-10-13 13:44:22,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1399748.0, ans=0.125 2023-10-13 13:44:25,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1399748.0, ans=0.125 2023-10-13 13:44:26,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1399748.0, ans=0.2 2023-10-13 13:44:55,381 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-10-13 13:45:00,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1399888.0, ans=0.0 2023-10-13 13:45:11,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1399934.6666666667, ans=0.125 2023-10-13 13:45:17,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1399934.6666666667, ans=0.5 2023-10-13 13:45:20,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1399934.6666666667, ans=0.125 2023-10-13 13:45:33,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1399981.3333333333, ans=0.1 2023-10-13 13:45:35,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-13 13:45:37,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.755e+02 1.956e+02 2.155e+02 2.813e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-13 13:45:39,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1400028.0, ans=0.125 2023-10-13 13:45:43,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.73 vs. limit=10.0 2023-10-13 13:46:21,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1400168.0, ans=0.1 2023-10-13 13:46:47,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1400308.0, ans=0.1 2023-10-13 13:46:59,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1400354.6666666667, ans=0.125 2023-10-13 13:47:08,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-10-13 13:47:14,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1400401.3333333333, ans=0.0 2023-10-13 13:47:24,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-10-13 13:47:31,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1400448.0, ans=0.125 2023-10-13 13:47:33,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-10-13 13:47:35,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.785e+02 1.961e+02 2.123e+02 2.663e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-13 13:47:53,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400541.3333333333, ans=0.1 2023-10-13 13:47:54,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2023-10-13 13:47:57,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-13 13:48:04,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1400588.0, ans=0.2 2023-10-13 13:48:22,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400681.3333333333, ans=0.1 2023-10-13 13:48:28,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1400681.3333333333, ans=0.1 2023-10-13 13:48:30,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1400728.0, ans=0.125 2023-10-13 13:49:00,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.34 vs. limit=10.0 2023-10-13 13:49:30,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.708e+02 1.866e+02 2.086e+02 2.738e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-13 13:49:54,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1401054.6666666667, ans=0.0 2023-10-13 13:49:58,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1401054.6666666667, ans=0.0 2023-10-13 13:49:59,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-10-13 13:50:10,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1401101.3333333333, ans=0.125 2023-10-13 13:50:23,988 INFO [train.py:1031] (3/4) Epoch 22, batch 13500, loss[loss=0.1802, simple_loss=0.2721, pruned_loss=0.04418, over 15914.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2803, pruned_loss=0.04818, over 32766339.14 frames. ], batch size: 43, lr: 1.55e-03, grad_scale: 32.0 2023-10-13 13:50:26,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=22.5 2023-10-13 13:50:33,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.06 vs. limit=15.0 2023-10-13 13:50:39,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1401241.3333333333, ans=0.125 2023-10-13 13:50:39,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1401241.3333333333, ans=0.125 2023-10-13 13:50:46,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1401288.0, ans=0.07 2023-10-13 13:50:47,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1401288.0, ans=0.05 2023-10-13 13:50:55,534 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-13 13:51:22,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.796e+02 1.980e+02 2.159e+02 2.705e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-13 13:51:43,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1401521.3333333333, ans=0.0 2023-10-13 13:51:57,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1401568.0, ans=0.0 2023-10-13 13:52:01,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1401568.0, ans=0.125 2023-10-13 13:52:07,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.22 vs. limit=10.0 2023-10-13 13:52:10,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1401614.6666666667, ans=0.125 2023-10-13 13:52:19,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-10-13 13:53:12,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.755e+02 1.915e+02 2.199e+02 3.264e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-13 13:53:55,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2023-10-13 13:53:56,019 INFO [train.py:1031] (3/4) Epoch 23, batch 0, loss[loss=0.1579, simple_loss=0.2294, pruned_loss=0.04316, over 12613.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.2294, pruned_loss=0.04316, over 12613.00 frames. ], batch size: 440, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 13:53:56,020 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-13 13:54:05,479 INFO [train.py:1063] (3/4) Epoch 23, validation: loss=0.2135, simple_loss=0.3003, pruned_loss=0.06333, over 1020973.00 frames. 2023-10-13 13:54:05,480 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-13 13:54:08,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1401918.0, ans=0.1 2023-10-13 13:54:09,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1401918.0, ans=0.04949747468305833 2023-10-13 13:54:28,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1401964.6666666667, ans=0.0 2023-10-13 13:54:32,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2023-10-13 13:54:33,624 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-10-13 13:54:50,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1402058.0, ans=0.125 2023-10-13 13:55:14,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1402151.3333333333, ans=0.015 2023-10-13 13:55:23,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1402198.0, ans=0.125 2023-10-13 13:55:28,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1402198.0, ans=0.125 2023-10-13 13:56:00,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1402338.0, ans=0.05 2023-10-13 13:56:01,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1402338.0, ans=0.125 2023-10-13 13:56:05,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.713e+02 1.876e+02 2.108e+02 3.132e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-13 13:56:20,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1402431.3333333333, ans=0.05 2023-10-13 13:56:21,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1402431.3333333333, ans=0.125 2023-10-13 13:57:04,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1402571.3333333333, ans=0.1 2023-10-13 13:57:08,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-13 13:57:15,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1402618.0, ans=0.2 2023-10-13 13:57:19,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1402664.6666666667, ans=0.1 2023-10-13 13:57:28,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-10-13 13:57:47,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1402758.0, ans=0.0 2023-10-13 13:57:47,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1402758.0, ans=0.125 2023-10-13 13:57:57,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.828e+02 1.987e+02 2.140e+02 3.875e+02, threshold=3.974e+02, percent-clipped=1.0 2023-10-13 13:58:02,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1402851.3333333333, ans=0.0 2023-10-13 13:58:08,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.72 vs. limit=6.0 2023-10-13 13:58:10,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1402898.0, ans=0.0 2023-10-13 13:58:32,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1402991.3333333333, ans=0.125 2023-10-13 13:58:38,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1402991.3333333333, ans=0.015 2023-10-13 13:58:54,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1403038.0, ans=0.2 2023-10-13 13:59:06,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-10-13 13:59:13,333 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-13 13:59:21,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1403178.0, ans=0.2 2023-10-13 13:59:36,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1403224.6666666667, ans=0.125 2023-10-13 13:59:38,404 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.27 vs. limit=22.5 2023-10-13 13:59:53,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.756e+02 1.940e+02 2.111e+02 2.645e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-13 14:00:38,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1403458.0, ans=0.1 2023-10-13 14:00:41,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1403458.0, ans=0.125 2023-10-13 14:01:00,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1403551.3333333333, ans=0.0 2023-10-13 14:01:00,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1403551.3333333333, ans=0.0 2023-10-13 14:01:04,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1403551.3333333333, ans=0.2 2023-10-13 14:01:14,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1403598.0, ans=0.125 2023-10-13 14:01:27,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403691.3333333333, ans=0.1 2023-10-13 14:01:30,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1403691.3333333333, ans=0.1 2023-10-13 14:01:38,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1403738.0, ans=0.2 2023-10-13 14:01:45,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.881e+02 2.038e+02 2.202e+02 2.818e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-13 14:02:02,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1403831.3333333333, ans=10.0 2023-10-13 14:02:05,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.50 vs. limit=10.0 2023-10-13 14:02:07,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1403831.3333333333, ans=0.1 2023-10-13 14:02:22,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1403878.0, ans=0.125 2023-10-13 14:03:00,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-10-13 14:03:01,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1404018.0, ans=0.09899494936611666 2023-10-13 14:03:18,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-10-13 14:03:41,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1404204.6666666667, ans=0.125 2023-10-13 14:03:41,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1404204.6666666667, ans=0.125 2023-10-13 14:03:48,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.807e+02 1.926e+02 2.078e+02 3.017e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-13 14:03:50,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1404204.6666666667, ans=0.0 2023-10-13 14:03:53,059 INFO [train.py:1031] (3/4) Epoch 23, batch 500, loss[loss=0.1827, simple_loss=0.2757, pruned_loss=0.04482, over 16849.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.281, pruned_loss=0.04876, over 7269210.36 frames. ], batch size: 72, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:03:56,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-10-13 14:04:22,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1404344.6666666667, ans=0.09899494936611666 2023-10-13 14:04:55,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1404484.6666666667, ans=0.125 2023-10-13 14:05:05,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.31 vs. limit=15.0 2023-10-13 14:05:13,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1404531.3333333333, ans=0.125 2023-10-13 14:05:17,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-13 14:05:22,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-10-13 14:05:33,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1404624.6666666667, ans=0.2 2023-10-13 14:05:33,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1404624.6666666667, ans=0.2 2023-10-13 14:05:48,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.771e+02 1.937e+02 2.125e+02 2.637e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 14:06:04,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404764.6666666667, ans=0.1 2023-10-13 14:06:06,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1404764.6666666667, ans=0.125 2023-10-13 14:06:10,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1404764.6666666667, ans=0.125 2023-10-13 14:06:23,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1404858.0, ans=0.95 2023-10-13 14:06:51,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1404951.3333333333, ans=0.1 2023-10-13 14:06:55,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404951.3333333333, ans=0.1 2023-10-13 14:07:36,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=22.5 2023-10-13 14:07:40,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.801e+02 2.009e+02 2.314e+02 3.512e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-13 14:07:40,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=12.0 2023-10-13 14:07:48,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-10-13 14:07:54,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2023-10-13 14:08:03,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.86 vs. limit=6.0 2023-10-13 14:08:05,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1405278.0, ans=0.2 2023-10-13 14:08:24,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1405324.6666666667, ans=0.0 2023-10-13 14:08:24,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1405324.6666666667, ans=0.1 2023-10-13 14:08:39,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1405371.3333333333, ans=0.125 2023-10-13 14:08:44,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=12.0 2023-10-13 14:09:15,400 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.44 vs. limit=15.0 2023-10-13 14:09:23,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1405558.0, ans=0.0 2023-10-13 14:09:31,948 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-10-13 14:09:34,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.728e+02 1.872e+02 2.111e+02 2.709e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-13 14:09:50,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1405698.0, ans=0.0 2023-10-13 14:10:05,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1405744.6666666667, ans=0.125 2023-10-13 14:10:05,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1405744.6666666667, ans=0.07 2023-10-13 14:11:05,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1405931.3333333333, ans=0.1 2023-10-13 14:11:11,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405978.0, ans=0.1 2023-10-13 14:11:22,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=12.0 2023-10-13 14:11:41,561 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.793e+02 1.943e+02 2.050e+02 2.533e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-13 14:11:50,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1406118.0, ans=0.0 2023-10-13 14:12:01,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1406164.6666666667, ans=0.125 2023-10-13 14:12:14,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1406211.3333333333, ans=0.125 2023-10-13 14:12:29,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1406258.0, ans=0.2 2023-10-13 14:12:31,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1406304.6666666667, ans=0.125 2023-10-13 14:12:52,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1406351.3333333333, ans=0.125 2023-10-13 14:12:58,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1406398.0, ans=0.0 2023-10-13 14:12:58,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1406398.0, ans=0.1 2023-10-13 14:13:20,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1406444.6666666667, ans=0.0 2023-10-13 14:13:44,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.685e+02 1.837e+02 2.058e+02 2.519e+02, threshold=3.674e+02, percent-clipped=0.0 2023-10-13 14:13:45,971 INFO [train.py:1031] (3/4) Epoch 23, batch 1000, loss[loss=0.2014, simple_loss=0.2973, pruned_loss=0.05273, over 16843.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2812, pruned_loss=0.0487, over 12935140.46 frames. ], batch size: 175, lr: 1.51e-03, grad_scale: 16.0 2023-10-13 14:13:59,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1406631.3333333333, ans=0.1 2023-10-13 14:14:01,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1406631.3333333333, ans=0.0 2023-10-13 14:14:17,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406724.6666666667, ans=0.1 2023-10-13 14:14:34,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1406771.3333333333, ans=0.0 2023-10-13 14:14:47,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1406818.0, ans=0.0 2023-10-13 14:14:49,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1406864.6666666667, ans=0.0 2023-10-13 14:14:51,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1406864.6666666667, ans=0.125 2023-10-13 14:15:04,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1406911.3333333333, ans=0.0 2023-10-13 14:15:27,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.36 vs. limit=15.0 2023-10-13 14:15:31,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1407004.6666666667, ans=0.125 2023-10-13 14:15:32,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.755e+02 1.983e+02 2.176e+02 2.671e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 14:15:46,106 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:16:22,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1407191.3333333333, ans=0.125 2023-10-13 14:16:29,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1407238.0, ans=0.125 2023-10-13 14:16:29,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1407238.0, ans=0.125 2023-10-13 14:16:31,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1407238.0, ans=0.125 2023-10-13 14:16:33,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.43 vs. limit=15.0 2023-10-13 14:16:43,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1407284.6666666667, ans=15.0 2023-10-13 14:16:55,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1407331.3333333333, ans=0.025 2023-10-13 14:16:57,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1407331.3333333333, ans=0.125 2023-10-13 14:17:21,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-10-13 14:17:22,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1407424.6666666667, ans=0.05 2023-10-13 14:17:22,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1407424.6666666667, ans=0.0 2023-10-13 14:17:32,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1407471.3333333333, ans=0.0 2023-10-13 14:17:37,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1407471.3333333333, ans=0.125 2023-10-13 14:17:38,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1407471.3333333333, ans=0.125 2023-10-13 14:17:41,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.722e+02 1.946e+02 2.218e+02 2.802e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 14:17:46,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2023-10-13 14:17:47,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-10-13 14:18:07,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1407611.3333333333, ans=0.125 2023-10-13 14:18:12,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.72 vs. limit=15.0 2023-10-13 14:18:15,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1407611.3333333333, ans=0.125 2023-10-13 14:18:25,511 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.92 vs. limit=15.0 2023-10-13 14:18:50,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1407798.0, ans=0.125 2023-10-13 14:18:51,896 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.71 vs. limit=15.0 2023-10-13 14:18:54,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1407798.0, ans=0.025 2023-10-13 14:19:03,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1407844.6666666667, ans=0.125 2023-10-13 14:19:10,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1407844.6666666667, ans=0.125 2023-10-13 14:19:34,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.738e+02 1.885e+02 2.137e+02 3.505e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-13 14:19:37,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1407984.6666666667, ans=0.125 2023-10-13 14:20:11,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1408124.6666666667, ans=0.125 2023-10-13 14:20:14,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-10-13 14:20:18,620 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.28 vs. limit=15.0 2023-10-13 14:20:21,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.58 vs. limit=6.0 2023-10-13 14:20:49,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1408264.6666666667, ans=0.125 2023-10-13 14:21:11,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1408358.0, ans=0.2 2023-10-13 14:21:13,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-10-13 14:21:24,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.766e+02 1.943e+02 2.105e+02 3.134e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 14:21:31,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1408451.3333333333, ans=0.125 2023-10-13 14:21:40,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.10 vs. limit=10.0 2023-10-13 14:21:46,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1408498.0, ans=0.0 2023-10-13 14:21:46,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-10-13 14:21:48,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-13 14:21:49,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1408544.6666666667, ans=0.125 2023-10-13 14:22:02,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1408591.3333333333, ans=0.0 2023-10-13 14:22:05,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1408591.3333333333, ans=0.125 2023-10-13 14:22:11,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1408591.3333333333, ans=0.1 2023-10-13 14:22:27,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-10-13 14:22:53,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1408778.0, ans=0.02 2023-10-13 14:22:56,049 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:23:04,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1408824.6666666667, ans=0.125 2023-10-13 14:23:16,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1408824.6666666667, ans=0.125 2023-10-13 14:23:21,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1408871.3333333333, ans=0.2 2023-10-13 14:23:28,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1408871.3333333333, ans=0.1 2023-10-13 14:23:29,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1408871.3333333333, ans=0.07 2023-10-13 14:23:29,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.687e+02 1.879e+02 2.091e+02 2.743e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 14:23:30,972 INFO [train.py:1031] (3/4) Epoch 23, batch 1500, loss[loss=0.1907, simple_loss=0.285, pruned_loss=0.04823, over 16913.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2794, pruned_loss=0.048, over 17305966.52 frames. ], batch size: 165, lr: 1.51e-03, grad_scale: 16.0 2023-10-13 14:23:31,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1408918.0, ans=0.125 2023-10-13 14:23:42,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-10-13 14:23:50,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1408964.6666666667, ans=0.125 2023-10-13 14:24:00,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1409011.3333333333, ans=0.125 2023-10-13 14:24:07,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1409058.0, ans=0.125 2023-10-13 14:24:16,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1409104.6666666667, ans=0.125 2023-10-13 14:24:25,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1409104.6666666667, ans=0.1 2023-10-13 14:24:27,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1409151.3333333333, ans=0.0 2023-10-13 14:24:39,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1409198.0, ans=0.0 2023-10-13 14:24:43,587 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.84 vs. limit=15.0 2023-10-13 14:25:30,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.734e+02 1.871e+02 2.078e+02 2.945e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-13 14:25:30,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-13 14:25:41,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.54 vs. limit=15.0 2023-10-13 14:26:05,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409524.6666666667, ans=0.1 2023-10-13 14:26:12,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1409524.6666666667, ans=0.125 2023-10-13 14:26:54,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1409664.6666666667, ans=0.2 2023-10-13 14:27:13,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1409758.0, ans=0.125 2023-10-13 14:27:29,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.858e+02 2.034e+02 2.316e+02 3.427e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-13 14:27:54,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1409944.6666666667, ans=0.125 2023-10-13 14:28:10,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.65 vs. limit=6.0 2023-10-13 14:28:18,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1410038.0, ans=0.0 2023-10-13 14:28:34,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1410084.6666666667, ans=0.125 2023-10-13 14:29:01,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1410224.6666666667, ans=0.125 2023-10-13 14:29:05,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1410224.6666666667, ans=0.0 2023-10-13 14:29:22,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-10-13 14:29:27,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.797e+02 1.978e+02 2.189e+02 3.307e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-13 14:29:35,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1410318.0, ans=0.0 2023-10-13 14:29:41,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1410364.6666666667, ans=0.2 2023-10-13 14:30:02,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1410458.0, ans=0.0 2023-10-13 14:30:21,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.84 vs. limit=22.5 2023-10-13 14:30:22,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=12.0 2023-10-13 14:30:41,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410598.0, ans=0.1 2023-10-13 14:30:46,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2023-10-13 14:30:54,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1410644.6666666667, ans=0.0 2023-10-13 14:30:56,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=6.0 2023-10-13 14:31:11,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1410738.0, ans=0.125 2023-10-13 14:31:13,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1410738.0, ans=0.1 2023-10-13 14:31:19,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.721e+02 1.939e+02 2.209e+02 3.132e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-13 14:31:22,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1410784.6666666667, ans=0.125 2023-10-13 14:31:29,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1410784.6666666667, ans=0.0 2023-10-13 14:31:38,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1410831.3333333333, ans=0.125 2023-10-13 14:31:51,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1410924.6666666667, ans=0.125 2023-10-13 14:32:36,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2023-10-13 14:32:45,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2023-10-13 14:32:48,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1411064.6666666667, ans=0.0 2023-10-13 14:32:58,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1411111.3333333333, ans=0.125 2023-10-13 14:33:02,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1411158.0, ans=0.125 2023-10-13 14:33:05,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1411158.0, ans=0.125 2023-10-13 14:33:05,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1411158.0, ans=0.05 2023-10-13 14:33:10,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1411158.0, ans=0.125 2023-10-13 14:33:22,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.36 vs. limit=15.0 2023-10-13 14:33:23,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1411204.6666666667, ans=0.0 2023-10-13 14:33:31,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.768e+02 1.960e+02 2.183e+02 2.979e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-13 14:33:31,353 INFO [train.py:1031] (3/4) Epoch 23, batch 2000, loss[loss=0.2131, simple_loss=0.3028, pruned_loss=0.06167, over 16089.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2806, pruned_loss=0.04824, over 20782468.46 frames. ], batch size: 296, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:33:58,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1411298.0, ans=0.0 2023-10-13 14:34:13,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1411344.6666666667, ans=0.1 2023-10-13 14:34:17,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1411344.6666666667, ans=0.0 2023-10-13 14:34:41,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1411438.0, ans=0.125 2023-10-13 14:34:48,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-13 14:34:58,675 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:35:23,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-10-13 14:35:40,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1411671.3333333333, ans=0.0 2023-10-13 14:35:40,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1411671.3333333333, ans=0.125 2023-10-13 14:35:53,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.763e+02 1.946e+02 2.151e+02 2.790e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 14:36:37,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1411811.3333333333, ans=15.0 2023-10-13 14:36:39,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1411811.3333333333, ans=0.0 2023-10-13 14:37:16,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1411951.3333333333, ans=0.125 2023-10-13 14:37:26,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1411998.0, ans=0.125 2023-10-13 14:37:37,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1412044.6666666667, ans=0.2 2023-10-13 14:37:59,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.60 vs. limit=22.5 2023-10-13 14:38:08,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1412138.0, ans=0.0 2023-10-13 14:38:13,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.852e+02 1.976e+02 2.212e+02 3.114e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 14:39:12,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1412418.0, ans=0.0 2023-10-13 14:39:57,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1412558.0, ans=0.0 2023-10-13 14:39:58,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1412558.0, ans=0.2 2023-10-13 14:40:05,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1412604.6666666667, ans=0.0 2023-10-13 14:40:14,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.788e+02 1.970e+02 2.119e+02 3.374e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-13 14:40:22,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1412651.3333333333, ans=0.125 2023-10-13 14:40:25,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1412698.0, ans=0.035 2023-10-13 14:40:41,153 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:40:52,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=12.0 2023-10-13 14:40:53,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1412791.3333333333, ans=0.0 2023-10-13 14:41:05,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-13 14:41:28,269 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:41:42,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1413024.6666666667, ans=0.2 2023-10-13 14:41:46,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1413024.6666666667, ans=0.125 2023-10-13 14:41:47,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-13 14:42:05,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.816e+02 2.001e+02 2.316e+02 2.970e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-13 14:42:15,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1413164.6666666667, ans=0.0 2023-10-13 14:43:19,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1413398.0, ans=0.125 2023-10-13 14:43:54,032 INFO [train.py:1031] (3/4) Epoch 23, batch 2500, loss[loss=0.1752, simple_loss=0.2643, pruned_loss=0.0431, over 16141.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2808, pruned_loss=0.04821, over 23489427.45 frames. ], batch size: 43, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:43:54,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.813e+02 1.957e+02 2.130e+02 3.349e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 14:43:59,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1413584.6666666667, ans=0.125 2023-10-13 14:44:17,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1413678.0, ans=0.125 2023-10-13 14:44:20,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1413678.0, ans=0.2 2023-10-13 14:44:24,769 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:44:31,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1413724.6666666667, ans=0.125 2023-10-13 14:44:44,740 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:44:52,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1413818.0, ans=0.0 2023-10-13 14:44:52,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1413818.0, ans=0.125 2023-10-13 14:44:54,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1413818.0, ans=0.05 2023-10-13 14:44:54,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1413818.0, ans=0.125 2023-10-13 14:45:07,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1413911.3333333333, ans=0.125 2023-10-13 14:45:43,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.827e+02 1.995e+02 2.238e+02 2.966e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-13 14:46:05,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1414144.6666666667, ans=0.2 2023-10-13 14:46:17,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414191.3333333333, ans=0.1 2023-10-13 14:46:38,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1414284.6666666667, ans=0.05 2023-10-13 14:47:16,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1414424.6666666667, ans=0.125 2023-10-13 14:47:34,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1414518.0, ans=0.2 2023-10-13 14:47:36,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.791e+02 1.910e+02 2.109e+02 2.876e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-13 14:47:56,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=15.0 2023-10-13 14:47:59,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.69 vs. limit=22.5 2023-10-13 14:48:10,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1414658.0, ans=0.0 2023-10-13 14:48:17,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414658.0, ans=0.1 2023-10-13 14:48:33,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414704.6666666667, ans=0.125 2023-10-13 14:48:35,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-10-13 14:48:58,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1414798.0, ans=0.2 2023-10-13 14:49:02,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1414798.0, ans=0.125 2023-10-13 14:49:13,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1414844.6666666667, ans=0.0 2023-10-13 14:49:14,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1414844.6666666667, ans=0.05 2023-10-13 14:49:27,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1414891.3333333333, ans=0.125 2023-10-13 14:49:44,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.750e+02 1.869e+02 2.071e+02 2.548e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-13 14:49:53,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-10-13 14:50:02,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1415031.3333333333, ans=0.1 2023-10-13 14:50:05,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.92 vs. limit=10.0 2023-10-13 14:50:38,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1415171.3333333333, ans=0.125 2023-10-13 14:50:43,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1415171.3333333333, ans=0.0 2023-10-13 14:50:52,605 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:51:03,306 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:51:25,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1415358.0, ans=0.125 2023-10-13 14:51:35,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1415358.0, ans=0.125 2023-10-13 14:51:40,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-10-13 14:51:46,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1415404.6666666667, ans=0.0 2023-10-13 14:51:54,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1415451.3333333333, ans=0.125 2023-10-13 14:51:55,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.787e+02 1.966e+02 2.219e+02 3.154e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 14:52:11,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1415498.0, ans=0.125 2023-10-13 14:52:34,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1415591.3333333333, ans=0.125 2023-10-13 14:52:49,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1415638.0, ans=0.125 2023-10-13 14:53:03,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1415731.3333333333, ans=10.0 2023-10-13 14:53:04,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1415731.3333333333, ans=0.125 2023-10-13 14:53:07,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1415731.3333333333, ans=0.125 2023-10-13 14:53:12,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1415731.3333333333, ans=0.0 2023-10-13 14:53:35,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1415871.3333333333, ans=0.2 2023-10-13 14:53:42,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1415871.3333333333, ans=0.125 2023-10-13 14:53:45,553 INFO [train.py:1031] (3/4) Epoch 23, batch 3000, loss[loss=0.1792, simple_loss=0.2781, pruned_loss=0.04011, over 16834.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2799, pruned_loss=0.04829, over 25520522.12 frames. ], batch size: 98, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:53:46,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.789e+02 1.927e+02 2.117e+02 2.963e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-13 14:53:53,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1415918.0, ans=0.0 2023-10-13 14:53:59,551 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.24 vs. limit=22.5 2023-10-13 14:54:00,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1415964.6666666667, ans=0.0 2023-10-13 14:54:00,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-10-13 14:54:19,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1416058.0, ans=0.1 2023-10-13 14:54:22,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1416058.0, ans=0.2 2023-10-13 14:54:26,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1416058.0, ans=0.125 2023-10-13 14:54:33,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1416104.6666666667, ans=0.04949747468305833 2023-10-13 14:54:39,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1416104.6666666667, ans=0.1 2023-10-13 14:54:55,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=15.0 2023-10-13 14:55:04,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1416244.6666666667, ans=0.5 2023-10-13 14:55:14,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1416291.3333333333, ans=0.1 2023-10-13 14:55:21,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1416291.3333333333, ans=0.0 2023-10-13 14:55:25,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1416291.3333333333, ans=0.0 2023-10-13 14:55:43,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.951e+02 2.200e+02 2.557e+02 3.361e+02, threshold=4.399e+02, percent-clipped=0.0 2023-10-13 14:55:51,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1416384.6666666667, ans=0.125 2023-10-13 14:56:02,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.05 vs. limit=15.0 2023-10-13 14:56:02,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-10-13 14:56:08,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=22.5 2023-10-13 14:56:10,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-10-13 14:56:18,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1416478.0, ans=15.0 2023-10-13 14:56:24,335 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.34 vs. limit=15.0 2023-10-13 14:56:31,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1416571.3333333333, ans=0.125 2023-10-13 14:56:56,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1416664.6666666667, ans=0.0 2023-10-13 14:57:14,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1416758.0, ans=0.125 2023-10-13 14:57:21,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-10-13 14:57:23,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1416758.0, ans=0.125 2023-10-13 14:57:36,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1416851.3333333333, ans=0.125 2023-10-13 14:57:36,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.744e+02 1.939e+02 2.110e+02 2.800e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-13 14:57:38,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1416851.3333333333, ans=0.0 2023-10-13 14:58:03,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1416944.6666666667, ans=0.125 2023-10-13 14:58:06,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-10-13 14:58:28,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1416991.3333333333, ans=0.125 2023-10-13 14:59:12,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1417178.0, ans=0.07 2023-10-13 14:59:21,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1417224.6666666667, ans=0.0 2023-10-13 14:59:30,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0 2023-10-13 14:59:37,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-10-13 14:59:46,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.821e+02 1.994e+02 2.176e+02 2.867e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-13 15:00:05,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1417364.6666666667, ans=0.125 2023-10-13 15:00:29,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1417504.6666666667, ans=0.125 2023-10-13 15:00:43,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-10-13 15:01:11,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-10-13 15:01:43,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.801e+02 1.965e+02 2.181e+02 3.016e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 15:02:04,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1417831.3333333333, ans=0.2 2023-10-13 15:02:10,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1417878.0, ans=0.5 2023-10-13 15:02:26,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1417924.6666666667, ans=0.125 2023-10-13 15:02:26,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1417924.6666666667, ans=0.04949747468305833 2023-10-13 15:02:28,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.73 vs. limit=12.0 2023-10-13 15:02:29,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1417971.3333333333, ans=0.025 2023-10-13 15:02:31,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1417971.3333333333, ans=0.2 2023-10-13 15:02:43,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1418018.0, ans=0.125 2023-10-13 15:02:49,871 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:02:53,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1418064.6666666667, ans=0.125 2023-10-13 15:02:56,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1418064.6666666667, ans=0.125 2023-10-13 15:03:31,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.17 vs. limit=10.0 2023-10-13 15:03:36,928 INFO [train.py:1031] (3/4) Epoch 23, batch 3500, loss[loss=0.1893, simple_loss=0.2801, pruned_loss=0.04923, over 16637.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2799, pruned_loss=0.04844, over 27124394.96 frames. ], batch size: 56, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 15:03:39,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.867e+02 2.006e+02 2.213e+02 3.297e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-13 15:03:39,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1418251.3333333333, ans=0.125 2023-10-13 15:03:56,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1418298.0, ans=0.125 2023-10-13 15:04:22,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1418438.0, ans=0.125 2023-10-13 15:04:29,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1418438.0, ans=0.025 2023-10-13 15:04:41,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1418484.6666666667, ans=0.2 2023-10-13 15:04:56,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-10-13 15:04:57,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1418531.3333333333, ans=0.2 2023-10-13 15:05:01,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1418531.3333333333, ans=0.1 2023-10-13 15:05:20,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1418624.6666666667, ans=0.125 2023-10-13 15:05:21,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1418624.6666666667, ans=0.2 2023-10-13 15:05:26,759 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:05:27,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1418624.6666666667, ans=0.125 2023-10-13 15:05:46,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1418718.0, ans=0.125 2023-10-13 15:05:46,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.771e+02 1.925e+02 2.185e+02 3.228e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-13 15:05:47,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.83 vs. limit=15.0 2023-10-13 15:05:48,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1418718.0, ans=0.125 2023-10-13 15:06:30,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.29 vs. limit=15.0 2023-10-13 15:06:50,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=15.0 2023-10-13 15:07:02,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-13 15:07:12,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1419044.6666666667, ans=0.0 2023-10-13 15:07:17,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1419044.6666666667, ans=0.0 2023-10-13 15:07:24,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419091.3333333333, ans=0.1 2023-10-13 15:07:29,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419091.3333333333, ans=0.1 2023-10-13 15:07:38,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1419138.0, ans=0.125 2023-10-13 15:07:44,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.705e+02 1.877e+02 2.098e+02 2.446e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-13 15:07:45,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1419184.6666666667, ans=0.125 2023-10-13 15:07:48,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1419184.6666666667, ans=0.0 2023-10-13 15:07:48,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.20 vs. limit=22.5 2023-10-13 15:07:57,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1419231.3333333333, ans=0.125 2023-10-13 15:07:57,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1419231.3333333333, ans=0.0 2023-10-13 15:08:06,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1419278.0, ans=0.2 2023-10-13 15:08:13,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1419278.0, ans=0.0 2023-10-13 15:08:39,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-10-13 15:08:40,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1419371.3333333333, ans=0.125 2023-10-13 15:08:44,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-10-13 15:08:56,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1419464.6666666667, ans=0.2 2023-10-13 15:09:01,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-10-13 15:09:12,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419511.3333333333, ans=0.1 2023-10-13 15:09:28,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1419558.0, ans=0.125 2023-10-13 15:09:30,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1419558.0, ans=0.2 2023-10-13 15:09:31,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1419604.6666666667, ans=0.125 2023-10-13 15:09:44,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1419651.3333333333, ans=0.0 2023-10-13 15:09:46,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.758e+02 1.871e+02 2.118e+02 2.968e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-13 15:09:49,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1419651.3333333333, ans=0.125 2023-10-13 15:10:09,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1419744.6666666667, ans=0.0 2023-10-13 15:10:10,669 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:10:21,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1419791.3333333333, ans=0.09899494936611666 2023-10-13 15:10:22,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1419791.3333333333, ans=0.0 2023-10-13 15:10:46,616 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-13 15:10:57,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1419931.3333333333, ans=0.2 2023-10-13 15:10:57,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-13 15:11:00,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-13 15:11:01,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1419931.3333333333, ans=0.0 2023-10-13 15:11:14,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420024.6666666667, ans=0.1 2023-10-13 15:11:21,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-10-13 15:11:28,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=12.0 2023-10-13 15:11:33,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1420071.3333333333, ans=0.125 2023-10-13 15:11:36,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420071.3333333333, ans=0.1 2023-10-13 15:11:42,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.737e+02 1.956e+02 2.141e+02 2.948e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-13 15:12:03,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1420211.3333333333, ans=0.125 2023-10-13 15:12:12,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1420258.0, ans=0.0 2023-10-13 15:12:17,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1420258.0, ans=0.0 2023-10-13 15:12:38,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1420351.3333333333, ans=0.125 2023-10-13 15:12:52,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1420398.0, ans=0.125 2023-10-13 15:12:57,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.78 vs. limit=15.0 2023-10-13 15:13:03,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1420444.6666666667, ans=0.125 2023-10-13 15:13:06,444 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:13:19,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1420491.3333333333, ans=0.0 2023-10-13 15:13:27,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1420538.0, ans=0.2 2023-10-13 15:13:32,658 INFO [train.py:1031] (3/4) Epoch 23, batch 4000, loss[loss=0.1874, simple_loss=0.2811, pruned_loss=0.04683, over 16936.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2796, pruned_loss=0.04857, over 28367208.41 frames. ], batch size: 138, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:13:35,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.782e+02 1.996e+02 2.148e+02 3.690e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 15:13:42,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1420584.6666666667, ans=0.125 2023-10-13 15:13:44,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420584.6666666667, ans=0.1 2023-10-13 15:14:10,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=15.0 2023-10-13 15:14:15,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1420724.6666666667, ans=0.04949747468305833 2023-10-13 15:14:25,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.79 vs. limit=22.5 2023-10-13 15:14:32,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1420771.3333333333, ans=0.05 2023-10-13 15:14:37,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-10-13 15:14:38,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1420818.0, ans=0.2 2023-10-13 15:14:39,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1420818.0, ans=0.125 2023-10-13 15:14:42,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1420818.0, ans=0.125 2023-10-13 15:14:43,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1420864.6666666667, ans=0.0 2023-10-13 15:14:59,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1420911.3333333333, ans=0.1 2023-10-13 15:15:03,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.27 vs. limit=22.5 2023-10-13 15:15:04,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1420911.3333333333, ans=0.125 2023-10-13 15:15:32,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.815e+02 1.954e+02 2.238e+02 2.920e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 15:15:36,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1421051.3333333333, ans=0.95 2023-10-13 15:15:45,323 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:15:56,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1421144.6666666667, ans=0.125 2023-10-13 15:16:01,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.93 vs. limit=15.0 2023-10-13 15:16:04,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1421144.6666666667, ans=0.125 2023-10-13 15:16:12,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1421191.3333333333, ans=0.0 2023-10-13 15:16:20,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1421238.0, ans=0.0 2023-10-13 15:16:49,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1421331.3333333333, ans=0.025 2023-10-13 15:17:01,596 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.29 vs. limit=10.0 2023-10-13 15:17:08,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1421378.0, ans=0.1 2023-10-13 15:17:33,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1421471.3333333333, ans=10.0 2023-10-13 15:17:40,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.828e+02 2.022e+02 2.247e+02 3.126e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-13 15:17:53,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1421564.6666666667, ans=0.09899494936611666 2023-10-13 15:18:22,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1421658.0, ans=0.1 2023-10-13 15:19:19,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1421891.3333333333, ans=0.125 2023-10-13 15:19:28,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1421938.0, ans=0.125 2023-10-13 15:19:32,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1421938.0, ans=0.125 2023-10-13 15:19:42,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.775e+02 1.955e+02 2.168e+02 3.437e+02, threshold=3.910e+02, percent-clipped=0.0 2023-10-13 15:20:12,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1422078.0, ans=0.035 2023-10-13 15:20:14,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1422124.6666666667, ans=0.125 2023-10-13 15:20:15,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1422124.6666666667, ans=0.125 2023-10-13 15:20:27,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422171.3333333333, ans=0.1 2023-10-13 15:20:28,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1422171.3333333333, ans=0.2 2023-10-13 15:20:30,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1422171.3333333333, ans=0.0 2023-10-13 15:20:56,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1422264.6666666667, ans=0.1 2023-10-13 15:21:00,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1422264.6666666667, ans=0.125 2023-10-13 15:21:05,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422311.3333333333, ans=0.1 2023-10-13 15:21:13,945 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.87 vs. limit=22.5 2023-10-13 15:21:19,736 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.07 vs. limit=22.5 2023-10-13 15:21:25,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1422358.0, ans=0.0 2023-10-13 15:21:30,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1422404.6666666667, ans=0.0 2023-10-13 15:21:38,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1422404.6666666667, ans=0.1 2023-10-13 15:21:45,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.982e+02 2.189e+02 2.540e+02 3.990e+02, threshold=4.379e+02, percent-clipped=1.0 2023-10-13 15:21:45,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422451.3333333333, ans=0.1 2023-10-13 15:21:50,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1422451.3333333333, ans=0.0 2023-10-13 15:22:09,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1422544.6666666667, ans=0.0 2023-10-13 15:22:19,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1422591.3333333333, ans=0.125 2023-10-13 15:22:45,669 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:23:10,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1422731.3333333333, ans=0.125 2023-10-13 15:23:28,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-13 15:23:29,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.22 vs. limit=22.5 2023-10-13 15:23:54,734 INFO [train.py:1031] (3/4) Epoch 23, batch 4500, loss[loss=0.1547, simple_loss=0.253, pruned_loss=0.0282, over 16803.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2798, pruned_loss=0.04823, over 29366933.86 frames. ], batch size: 98, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:23:59,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.821e+02 1.951e+02 2.134e+02 2.875e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-13 15:24:01,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1422918.0, ans=0.0 2023-10-13 15:24:02,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422918.0, ans=0.1 2023-10-13 15:24:09,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1422964.6666666667, ans=0.125 2023-10-13 15:24:26,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423011.3333333333, ans=0.1 2023-10-13 15:24:53,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2023-10-13 15:24:53,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1423151.3333333333, ans=0.0 2023-10-13 15:25:21,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423244.6666666667, ans=0.1 2023-10-13 15:25:24,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1423291.3333333333, ans=0.2 2023-10-13 15:25:26,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1423291.3333333333, ans=0.0 2023-10-13 15:25:35,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1423338.0, ans=0.0 2023-10-13 15:25:39,968 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:25:43,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1423338.0, ans=0.1 2023-10-13 15:25:46,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1423384.6666666667, ans=0.2 2023-10-13 15:25:46,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1423384.6666666667, ans=0.09899494936611666 2023-10-13 15:25:47,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1423384.6666666667, ans=0.125 2023-10-13 15:25:49,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.807e+02 1.963e+02 2.156e+02 3.122e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-13 15:26:03,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.44 vs. limit=15.0 2023-10-13 15:26:09,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1423478.0, ans=0.0 2023-10-13 15:26:10,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1423478.0, ans=0.125 2023-10-13 15:26:27,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1423571.3333333333, ans=0.125 2023-10-13 15:26:32,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1423571.3333333333, ans=0.125 2023-10-13 15:26:39,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.99 vs. limit=12.0 2023-10-13 15:26:39,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-10-13 15:26:48,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1423618.0, ans=0.125 2023-10-13 15:26:55,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1423664.6666666667, ans=0.125 2023-10-13 15:27:06,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1423711.3333333333, ans=0.0 2023-10-13 15:27:19,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1423758.0, ans=0.07 2023-10-13 15:27:30,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1423804.6666666667, ans=0.1 2023-10-13 15:27:39,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.885e+02 2.091e+02 2.439e+02 4.031e+02, threshold=4.183e+02, percent-clipped=1.0 2023-10-13 15:27:51,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423898.0, ans=0.1 2023-10-13 15:27:56,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1423944.6666666667, ans=0.0 2023-10-13 15:27:59,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423944.6666666667, ans=0.1 2023-10-13 15:28:11,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-13 15:28:15,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-13 15:28:18,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424038.0, ans=0.1 2023-10-13 15:28:32,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424084.6666666667, ans=0.1 2023-10-13 15:28:37,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1424084.6666666667, ans=0.125 2023-10-13 15:28:42,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1424131.3333333333, ans=0.2 2023-10-13 15:28:56,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1424178.0, ans=0.125 2023-10-13 15:28:57,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1424178.0, ans=0.0 2023-10-13 15:29:09,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-10-13 15:29:11,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424224.6666666667, ans=0.1 2023-10-13 15:29:14,557 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:29:25,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1424318.0, ans=0.2 2023-10-13 15:29:26,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.829e+02 2.045e+02 2.258e+02 3.006e+02, threshold=4.090e+02, percent-clipped=0.0 2023-10-13 15:29:35,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1424364.6666666667, ans=0.07 2023-10-13 15:30:17,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1424504.6666666667, ans=0.0 2023-10-13 15:30:50,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1424644.6666666667, ans=0.0 2023-10-13 15:30:53,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1424644.6666666667, ans=0.0 2023-10-13 15:31:16,593 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:31:17,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1424738.0, ans=0.125 2023-10-13 15:31:20,913 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.11 vs. limit=15.0 2023-10-13 15:31:26,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1424784.6666666667, ans=0.2 2023-10-13 15:31:27,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.727e+02 1.911e+02 2.120e+02 2.693e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-13 15:31:43,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1424831.3333333333, ans=0.125 2023-10-13 15:32:01,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1424924.6666666667, ans=0.125 2023-10-13 15:32:20,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1424971.3333333333, ans=0.125 2023-10-13 15:32:25,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1425018.0, ans=0.125 2023-10-13 15:32:30,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425018.0, ans=0.1 2023-10-13 15:32:38,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425064.6666666667, ans=0.1 2023-10-13 15:32:58,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1425111.3333333333, ans=0.125 2023-10-13 15:33:25,059 INFO [train.py:1031] (3/4) Epoch 23, batch 5000, loss[loss=0.1708, simple_loss=0.2629, pruned_loss=0.03931, over 16898.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2797, pruned_loss=0.04839, over 30134257.90 frames. ], batch size: 72, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:33:26,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1425251.3333333333, ans=0.0 2023-10-13 15:33:27,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.874e+02 2.072e+02 2.282e+02 2.927e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-13 15:33:28,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1425251.3333333333, ans=0.1 2023-10-13 15:33:43,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1425298.0, ans=0.125 2023-10-13 15:33:43,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1425298.0, ans=0.0 2023-10-13 15:34:24,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1425438.0, ans=0.2 2023-10-13 15:34:31,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1425484.6666666667, ans=0.05 2023-10-13 15:34:34,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1425484.6666666667, ans=0.125 2023-10-13 15:34:39,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-10-13 15:35:22,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=15.0 2023-10-13 15:35:36,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.789e+02 1.982e+02 2.203e+02 2.970e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 15:36:02,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1425811.3333333333, ans=0.0 2023-10-13 15:36:14,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1425858.0, ans=0.125 2023-10-13 15:36:23,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425858.0, ans=0.1 2023-10-13 15:36:28,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-13 15:36:31,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1425904.6666666667, ans=0.09899494936611666 2023-10-13 15:36:50,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1425951.3333333333, ans=0.0 2023-10-13 15:36:51,432 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=22.5 2023-10-13 15:37:02,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1425998.0, ans=0.125 2023-10-13 15:37:34,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1426138.0, ans=0.0 2023-10-13 15:37:36,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1426138.0, ans=0.125 2023-10-13 15:37:36,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1426138.0, ans=0.0 2023-10-13 15:37:45,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-10-13 15:37:46,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.865e+02 2.029e+02 2.239e+02 3.831e+02, threshold=4.057e+02, percent-clipped=0.0 2023-10-13 15:37:53,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1426231.3333333333, ans=0.125 2023-10-13 15:37:56,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1426231.3333333333, ans=0.1 2023-10-13 15:38:04,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1426231.3333333333, ans=0.125 2023-10-13 15:39:00,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1426464.6666666667, ans=0.0 2023-10-13 15:39:08,092 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.69 vs. limit=15.0 2023-10-13 15:39:17,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1426558.0, ans=0.05 2023-10-13 15:39:34,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1426604.6666666667, ans=0.125 2023-10-13 15:39:44,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.37 vs. limit=22.5 2023-10-13 15:39:47,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.694e+02 1.917e+02 2.185e+02 2.847e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-13 15:39:53,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1426651.3333333333, ans=10.0 2023-10-13 15:40:04,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1426698.0, ans=0.05 2023-10-13 15:40:21,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1426791.3333333333, ans=0.2 2023-10-13 15:40:37,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1426838.0, ans=0.125 2023-10-13 15:40:38,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1426838.0, ans=0.0 2023-10-13 15:40:46,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1426884.6666666667, ans=0.0 2023-10-13 15:40:48,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1426884.6666666667, ans=0.0 2023-10-13 15:40:51,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1426884.6666666667, ans=0.125 2023-10-13 15:41:24,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1427024.6666666667, ans=0.125 2023-10-13 15:41:26,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1427024.6666666667, ans=0.125 2023-10-13 15:41:50,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.662e+02 1.859e+02 2.111e+02 2.676e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-13 15:41:55,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1427118.0, ans=0.5 2023-10-13 15:42:08,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1427164.6666666667, ans=0.0 2023-10-13 15:42:15,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1427211.3333333333, ans=0.0 2023-10-13 15:42:17,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1427211.3333333333, ans=0.125 2023-10-13 15:42:18,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.64 vs. limit=15.0 2023-10-13 15:42:35,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1427304.6666666667, ans=0.125 2023-10-13 15:42:43,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1427304.6666666667, ans=0.0 2023-10-13 15:42:55,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-10-13 15:43:07,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.91 vs. limit=15.0 2023-10-13 15:43:12,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1427398.0, ans=0.125 2023-10-13 15:43:37,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1427538.0, ans=0.07 2023-10-13 15:43:42,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1427538.0, ans=0.125 2023-10-13 15:43:46,618 INFO [train.py:1031] (3/4) Epoch 23, batch 5500, loss[loss=0.1733, simple_loss=0.2619, pruned_loss=0.04236, over 16599.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2794, pruned_loss=0.04835, over 30699675.92 frames. ], batch size: 56, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:43:50,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.795e+02 1.911e+02 2.137e+02 2.758e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-13 15:44:00,020 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:44:17,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1427678.0, ans=0.125 2023-10-13 15:44:35,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1427771.3333333333, ans=0.04949747468305833 2023-10-13 15:44:44,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1427818.0, ans=0.125 2023-10-13 15:44:47,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1427818.0, ans=0.2 2023-10-13 15:44:53,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427864.6666666667, ans=0.1 2023-10-13 15:45:02,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1427864.6666666667, ans=0.125 2023-10-13 15:45:07,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1427911.3333333333, ans=0.125 2023-10-13 15:45:11,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.62 vs. limit=15.0 2023-10-13 15:45:23,529 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.13 vs. limit=15.0 2023-10-13 15:45:30,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1428004.6666666667, ans=0.0 2023-10-13 15:45:34,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1428004.6666666667, ans=0.2 2023-10-13 15:45:37,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.97 vs. limit=15.0 2023-10-13 15:45:43,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1428051.3333333333, ans=0.0 2023-10-13 15:45:46,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.780e+02 1.897e+02 2.131e+02 2.790e+02, threshold=3.794e+02, percent-clipped=0.0 2023-10-13 15:45:51,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1428098.0, ans=0.125 2023-10-13 15:45:59,380 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-10-13 15:46:22,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1428191.3333333333, ans=0.0 2023-10-13 15:46:24,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1428191.3333333333, ans=0.125 2023-10-13 15:46:30,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428238.0, ans=0.1 2023-10-13 15:46:33,789 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-13 15:46:40,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1428284.6666666667, ans=0.125 2023-10-13 15:46:51,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1428331.3333333333, ans=0.125 2023-10-13 15:47:01,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1428378.0, ans=0.125 2023-10-13 15:47:03,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1428378.0, ans=0.125 2023-10-13 15:47:07,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1428378.0, ans=0.125 2023-10-13 15:47:13,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1428424.6666666667, ans=0.1 2023-10-13 15:47:17,046 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:47:25,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2023-10-13 15:47:45,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.862e+02 2.120e+02 2.487e+02 3.238e+02, threshold=4.239e+02, percent-clipped=0.0 2023-10-13 15:47:54,981 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.39 vs. limit=15.0 2023-10-13 15:48:04,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1428611.3333333333, ans=0.0 2023-10-13 15:48:08,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-10-13 15:48:13,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1428658.0, ans=0.125 2023-10-13 15:48:32,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1428704.6666666667, ans=0.125 2023-10-13 15:48:35,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1428751.3333333333, ans=0.0 2023-10-13 15:48:49,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1428798.0, ans=0.1 2023-10-13 15:49:06,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1428844.6666666667, ans=0.025 2023-10-13 15:49:16,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.22 vs. limit=22.5 2023-10-13 15:49:17,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-10-13 15:49:39,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.783e+02 1.977e+02 2.195e+02 3.165e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 15:49:39,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1428984.6666666667, ans=0.125 2023-10-13 15:50:10,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1429124.6666666667, ans=0.125 2023-10-13 15:50:16,529 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-10-13 15:50:20,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1429124.6666666667, ans=0.125 2023-10-13 15:50:31,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-13 15:50:37,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1429218.0, ans=0.125 2023-10-13 15:50:54,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1429264.6666666667, ans=0.1 2023-10-13 15:51:21,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1429358.0, ans=0.125 2023-10-13 15:51:21,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1429358.0, ans=0.125 2023-10-13 15:51:23,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1429358.0, ans=0.125 2023-10-13 15:51:46,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.857e+02 1.999e+02 2.203e+02 3.065e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-13 15:51:54,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1429498.0, ans=0.125 2023-10-13 15:51:54,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1429498.0, ans=0.125 2023-10-13 15:52:10,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429544.6666666667, ans=0.1 2023-10-13 15:52:10,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=15.0 2023-10-13 15:52:15,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429544.6666666667, ans=0.1 2023-10-13 15:52:16,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1429591.3333333333, ans=0.125 2023-10-13 15:52:19,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1429591.3333333333, ans=0.125 2023-10-13 15:52:24,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1429591.3333333333, ans=0.0 2023-10-13 15:52:43,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1429684.6666666667, ans=0.125 2023-10-13 15:53:06,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1429778.0, ans=0.0 2023-10-13 15:53:15,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1429824.6666666667, ans=0.125 2023-10-13 15:53:19,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1429824.6666666667, ans=0.125 2023-10-13 15:53:26,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1429871.3333333333, ans=0.125 2023-10-13 15:53:41,379 INFO [train.py:1031] (3/4) Epoch 23, batch 6000, loss[loss=0.1984, simple_loss=0.286, pruned_loss=0.05542, over 17019.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2799, pruned_loss=0.04852, over 31180340.39 frames. ], batch size: 117, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:53:48,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.748e+02 1.983e+02 2.154e+02 2.929e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 15:53:50,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1429918.0, ans=0.125 2023-10-13 15:54:17,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1430011.3333333333, ans=0.125 2023-10-13 15:54:24,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1430058.0, ans=0.0 2023-10-13 15:54:25,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1430058.0, ans=10.0 2023-10-13 15:54:25,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-13 15:54:26,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.77 vs. limit=22.5 2023-10-13 15:54:50,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1430151.3333333333, ans=0.0 2023-10-13 15:54:59,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1430198.0, ans=0.125 2023-10-13 15:55:17,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1430244.6666666667, ans=0.125 2023-10-13 15:55:50,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-10-13 15:55:52,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.777e+02 1.921e+02 2.075e+02 2.854e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-13 15:55:58,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1430431.3333333333, ans=0.2 2023-10-13 15:56:00,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1430431.3333333333, ans=0.125 2023-10-13 15:56:00,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1430431.3333333333, ans=0.2 2023-10-13 15:56:08,533 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:56:26,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1430524.6666666667, ans=0.0 2023-10-13 15:56:27,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1430524.6666666667, ans=0.125 2023-10-13 15:56:39,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1430571.3333333333, ans=0.125 2023-10-13 15:56:41,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1430571.3333333333, ans=0.125 2023-10-13 15:56:42,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1430618.0, ans=0.125 2023-10-13 15:56:49,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1430618.0, ans=0.125 2023-10-13 15:57:37,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2023-10-13 15:57:51,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1430851.3333333333, ans=0.125 2023-10-13 15:57:52,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.822e+02 1.956e+02 2.154e+02 2.813e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-13 15:57:53,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1430851.3333333333, ans=0.1 2023-10-13 15:58:18,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1430944.6666666667, ans=0.125 2023-10-13 15:58:19,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2023-10-13 15:58:20,529 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-10-13 15:58:20,560 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2023-10-13 15:58:35,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1431038.0, ans=0.0 2023-10-13 15:58:39,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=22.5 2023-10-13 15:58:45,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1431038.0, ans=0.125 2023-10-13 15:59:10,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1431131.3333333333, ans=0.125 2023-10-13 15:59:19,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1431178.0, ans=0.0 2023-10-13 15:59:25,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1431178.0, ans=0.0 2023-10-13 15:59:29,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1431224.6666666667, ans=0.125 2023-10-13 15:59:49,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2023-10-13 16:00:07,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.23 vs. limit=15.0 2023-10-13 16:00:07,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.843e+02 2.036e+02 2.211e+02 3.149e+02, threshold=4.072e+02, percent-clipped=0.0 2023-10-13 16:00:29,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1431411.3333333333, ans=0.0 2023-10-13 16:00:50,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1431458.0, ans=0.0 2023-10-13 16:00:58,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1431504.6666666667, ans=0.125 2023-10-13 16:01:18,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1431598.0, ans=0.025 2023-10-13 16:01:22,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1431598.0, ans=0.0 2023-10-13 16:02:20,126 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.802e+02 1.953e+02 2.191e+02 3.655e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-13 16:02:21,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1431784.6666666667, ans=0.125 2023-10-13 16:02:33,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.06 vs. limit=22.5 2023-10-13 16:02:40,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1431878.0, ans=0.125 2023-10-13 16:02:46,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1431878.0, ans=0.125 2023-10-13 16:02:52,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1431924.6666666667, ans=0.1 2023-10-13 16:02:59,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=15.0 2023-10-13 16:03:38,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1432111.3333333333, ans=0.0 2023-10-13 16:03:43,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1432111.3333333333, ans=0.0 2023-10-13 16:04:15,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1432204.6666666667, ans=0.125 2023-10-13 16:04:17,659 INFO [train.py:1031] (3/4) Epoch 23, batch 6500, loss[loss=0.1846, simple_loss=0.2732, pruned_loss=0.04798, over 16630.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2804, pruned_loss=0.04876, over 31521835.21 frames. ], batch size: 61, lr: 1.50e-03, grad_scale: 16.0 2023-10-13 16:04:28,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.901e+02 2.138e+02 2.399e+02 2.972e+02, threshold=4.275e+02, percent-clipped=0.0 2023-10-13 16:04:30,808 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:04:41,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1432298.0, ans=0.125 2023-10-13 16:04:49,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1432344.6666666667, ans=0.2 2023-10-13 16:05:00,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1432391.3333333333, ans=0.025 2023-10-13 16:05:11,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1432391.3333333333, ans=0.0 2023-10-13 16:05:19,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1432438.0, ans=0.2 2023-10-13 16:05:28,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1432484.6666666667, ans=0.125 2023-10-13 16:05:28,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1432484.6666666667, ans=0.125 2023-10-13 16:05:35,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.15 vs. limit=12.0 2023-10-13 16:05:38,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.15 vs. limit=12.0 2023-10-13 16:05:40,515 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:05:48,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1432531.3333333333, ans=0.0 2023-10-13 16:05:55,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1432578.0, ans=0.125 2023-10-13 16:05:56,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1432578.0, ans=0.2 2023-10-13 16:06:23,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-13 16:06:42,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.787e+02 1.902e+02 2.050e+02 2.910e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-13 16:06:44,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-10-13 16:07:03,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1432811.3333333333, ans=0.0 2023-10-13 16:07:31,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1432904.6666666667, ans=0.2 2023-10-13 16:07:47,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2023-10-13 16:08:12,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1433091.3333333333, ans=0.0 2023-10-13 16:08:26,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2023-10-13 16:08:41,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.786e+02 1.996e+02 2.181e+02 3.146e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 16:08:49,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1433231.3333333333, ans=0.2 2023-10-13 16:08:53,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1433231.3333333333, ans=0.1 2023-10-13 16:09:06,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1433278.0, ans=0.1 2023-10-13 16:09:08,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1433324.6666666667, ans=0.0 2023-10-13 16:09:12,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1433324.6666666667, ans=0.125 2023-10-13 16:09:14,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433324.6666666667, ans=0.125 2023-10-13 16:09:28,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-10-13 16:09:34,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1433418.0, ans=0.0 2023-10-13 16:09:41,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1433418.0, ans=0.0 2023-10-13 16:09:41,686 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-13 16:10:45,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1433651.3333333333, ans=0.125 2023-10-13 16:10:57,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.687e+02 1.878e+02 2.141e+02 3.146e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 16:11:12,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=22.5 2023-10-13 16:11:20,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1433744.6666666667, ans=0.125 2023-10-13 16:11:44,451 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-13 16:11:54,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1433884.6666666667, ans=0.0 2023-10-13 16:12:17,792 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.01 vs. limit=15.0 2023-10-13 16:12:18,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1433978.0, ans=0.0 2023-10-13 16:12:20,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433978.0, ans=0.125 2023-10-13 16:13:03,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.694e+02 1.874e+02 2.041e+02 2.904e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 16:13:35,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1434258.0, ans=0.125 2023-10-13 16:13:39,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.59 vs. limit=15.0 2023-10-13 16:13:50,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1434304.6666666667, ans=0.125 2023-10-13 16:14:11,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=12.0 2023-10-13 16:14:56,235 INFO [train.py:1031] (3/4) Epoch 23, batch 7000, loss[loss=0.1974, simple_loss=0.2869, pruned_loss=0.05399, over 16998.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2809, pruned_loss=0.04877, over 31815777.12 frames. ], batch size: 123, lr: 1.50e-03, grad_scale: 16.0 2023-10-13 16:14:56,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1434584.6666666667, ans=0.5 2023-10-13 16:15:03,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434584.6666666667, ans=0.1 2023-10-13 16:15:04,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.834e+02 1.957e+02 2.282e+02 3.219e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 16:15:27,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1434678.0, ans=0.125 2023-10-13 16:15:44,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1434724.6666666667, ans=0.125 2023-10-13 16:16:06,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1434818.0, ans=0.0 2023-10-13 16:16:06,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1434818.0, ans=0.125 2023-10-13 16:16:07,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-10-13 16:16:14,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1434864.6666666667, ans=0.125 2023-10-13 16:16:49,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=15.0 2023-10-13 16:17:01,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435051.3333333333, ans=0.1 2023-10-13 16:17:03,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.828e+02 1.958e+02 2.290e+02 2.902e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-13 16:17:14,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1435144.6666666667, ans=0.0 2023-10-13 16:17:19,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1435144.6666666667, ans=0.025 2023-10-13 16:17:40,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1435238.0, ans=0.125 2023-10-13 16:17:47,925 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=22.5 2023-10-13 16:17:48,036 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.58 vs. limit=10.0 2023-10-13 16:18:05,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1435331.3333333333, ans=0.125 2023-10-13 16:18:05,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1435331.3333333333, ans=0.95 2023-10-13 16:18:12,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1435331.3333333333, ans=0.125 2023-10-13 16:18:13,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435331.3333333333, ans=0.1 2023-10-13 16:18:14,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1435378.0, ans=0.0 2023-10-13 16:18:19,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1435378.0, ans=0.125 2023-10-13 16:18:22,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.54 vs. limit=22.5 2023-10-13 16:18:26,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1435424.6666666667, ans=0.0 2023-10-13 16:18:33,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.12 vs. limit=22.5 2023-10-13 16:18:40,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=12.0 2023-10-13 16:18:56,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1435518.0, ans=0.2 2023-10-13 16:19:01,252 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:19:01,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=1435518.0, ans=0.1 2023-10-13 16:19:01,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.802e+02 1.986e+02 2.152e+02 2.844e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-13 16:19:07,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1435564.6666666667, ans=0.09899494936611666 2023-10-13 16:19:16,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-10-13 16:19:52,703 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:20:19,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1435798.0, ans=0.125 2023-10-13 16:20:23,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435798.0, ans=0.1 2023-10-13 16:20:29,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435844.6666666667, ans=0.1 2023-10-13 16:20:54,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1435938.0, ans=0.125 2023-10-13 16:21:01,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1435938.0, ans=0.2 2023-10-13 16:21:17,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.797e+02 2.032e+02 2.285e+02 3.283e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-13 16:22:01,792 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-13 16:22:15,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.82 vs. limit=22.5 2023-10-13 16:22:17,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1436218.0, ans=0.125 2023-10-13 16:22:21,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1436264.6666666667, ans=15.0 2023-10-13 16:22:37,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.22 vs. limit=12.0 2023-10-13 16:22:44,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1436358.0, ans=0.125 2023-10-13 16:22:56,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1436404.6666666667, ans=0.0 2023-10-13 16:23:06,736 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-10-13 16:23:10,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1436451.3333333333, ans=0.025 2023-10-13 16:23:14,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1436451.3333333333, ans=0.125 2023-10-13 16:23:14,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.751e+02 1.904e+02 2.128e+02 2.870e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 16:23:19,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1436498.0, ans=0.125 2023-10-13 16:23:26,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1436498.0, ans=0.125 2023-10-13 16:23:39,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1436544.6666666667, ans=0.125 2023-10-13 16:23:41,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-13 16:23:44,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2023-10-13 16:23:59,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1436638.0, ans=0.5 2023-10-13 16:24:02,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1436684.6666666667, ans=0.125 2023-10-13 16:24:04,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1436684.6666666667, ans=0.0 2023-10-13 16:24:05,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1436684.6666666667, ans=0.0 2023-10-13 16:24:05,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1436684.6666666667, ans=0.125 2023-10-13 16:24:06,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1436684.6666666667, ans=0.1 2023-10-13 16:24:21,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1436731.3333333333, ans=0.0 2023-10-13 16:24:25,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436778.0, ans=0.1 2023-10-13 16:24:28,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1436778.0, ans=0.125 2023-10-13 16:24:41,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1436824.6666666667, ans=0.0 2023-10-13 16:25:02,009 INFO [train.py:1031] (3/4) Epoch 23, batch 7500, loss[loss=0.1973, simple_loss=0.2753, pruned_loss=0.05964, over 16560.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2806, pruned_loss=0.04872, over 32022503.22 frames. ], batch size: 56, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 16:25:09,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1436918.0, ans=0.0 2023-10-13 16:25:11,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.763e+02 1.970e+02 2.127e+02 2.890e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-13 16:25:15,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1436964.6666666667, ans=0.125 2023-10-13 16:25:29,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1437011.3333333333, ans=0.125 2023-10-13 16:25:56,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=12.0 2023-10-13 16:26:10,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1437151.3333333333, ans=0.2 2023-10-13 16:26:16,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1437198.0, ans=0.125 2023-10-13 16:26:20,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1437198.0, ans=0.125 2023-10-13 16:26:44,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1437291.3333333333, ans=0.2 2023-10-13 16:26:46,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1437291.3333333333, ans=0.0 2023-10-13 16:27:01,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1437338.0, ans=0.0 2023-10-13 16:27:06,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1437384.6666666667, ans=0.125 2023-10-13 16:27:15,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.770e+02 1.941e+02 2.174e+02 3.115e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-13 16:27:43,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1437524.6666666667, ans=0.0 2023-10-13 16:28:00,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1437571.3333333333, ans=0.0 2023-10-13 16:28:04,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1437571.3333333333, ans=0.2 2023-10-13 16:28:04,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1437571.3333333333, ans=0.125 2023-10-13 16:28:08,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.23 vs. limit=10.0 2023-10-13 16:28:16,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1437618.0, ans=0.125 2023-10-13 16:28:31,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1437664.6666666667, ans=0.0 2023-10-13 16:28:38,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1437664.6666666667, ans=0.0 2023-10-13 16:29:01,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1437758.0, ans=0.0 2023-10-13 16:29:14,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1437804.6666666667, ans=0.125 2023-10-13 16:29:19,604 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.26 vs. limit=15.0 2023-10-13 16:29:33,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1437851.3333333333, ans=0.125 2023-10-13 16:29:39,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1437898.0, ans=0.0 2023-10-13 16:29:39,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-10-13 16:29:40,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.769e+02 1.911e+02 2.100e+02 3.111e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-13 16:29:40,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1437898.0, ans=0.0 2023-10-13 16:29:41,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1437898.0, ans=0.0 2023-10-13 16:29:57,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1437944.6666666667, ans=0.0 2023-10-13 16:30:18,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1438038.0, ans=0.09899494936611666 2023-10-13 16:30:24,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-13 16:30:34,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1438084.6666666667, ans=0.125 2023-10-13 16:30:45,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1438131.3333333333, ans=0.0 2023-10-13 16:31:03,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1438178.0, ans=0.0 2023-10-13 16:31:06,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1438224.6666666667, ans=0.0 2023-10-13 16:31:13,052 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=12.0 2023-10-13 16:31:33,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1438318.0, ans=0.015 2023-10-13 16:31:35,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1438318.0, ans=0.125 2023-10-13 16:31:38,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1438318.0, ans=0.125 2023-10-13 16:31:43,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.851e+02 2.001e+02 2.200e+02 4.052e+02, threshold=4.001e+02, percent-clipped=1.0 2023-10-13 16:31:55,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1438364.6666666667, ans=0.125 2023-10-13 16:31:57,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-10-13 16:32:25,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1438504.6666666667, ans=0.0 2023-10-13 16:32:36,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1438551.3333333333, ans=0.125 2023-10-13 16:32:47,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1438551.3333333333, ans=0.125 2023-10-13 16:32:54,356 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2023-10-13 16:32:55,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=12.0 2023-10-13 16:33:03,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438644.6666666667, ans=0.1 2023-10-13 16:33:21,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1438691.3333333333, ans=0.0 2023-10-13 16:33:35,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1438738.0, ans=0.2 2023-10-13 16:33:58,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1438831.3333333333, ans=0.0 2023-10-13 16:34:00,375 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.777e+02 1.947e+02 2.096e+02 2.525e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 16:34:07,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1438831.3333333333, ans=0.125 2023-10-13 16:34:43,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1438971.3333333333, ans=0.0 2023-10-13 16:34:45,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1438971.3333333333, ans=0.0 2023-10-13 16:35:04,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1439064.6666666667, ans=0.0 2023-10-13 16:35:06,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439064.6666666667, ans=0.1 2023-10-13 16:35:35,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1439158.0, ans=0.0 2023-10-13 16:35:38,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1439158.0, ans=0.125 2023-10-13 16:35:55,693 INFO [train.py:1031] (3/4) Epoch 23, batch 8000, loss[loss=0.1894, simple_loss=0.281, pruned_loss=0.04889, over 16926.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2801, pruned_loss=0.04821, over 32173336.71 frames. ], batch size: 110, lr: 1.50e-03, grad_scale: 16.0 2023-10-13 16:35:58,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1439251.3333333333, ans=0.1 2023-10-13 16:35:58,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1439251.3333333333, ans=0.125 2023-10-13 16:36:04,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439251.3333333333, ans=0.1 2023-10-13 16:36:08,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.666e+02 1.834e+02 2.012e+02 3.002e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-13 16:36:33,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1439391.3333333333, ans=0.025 2023-10-13 16:36:36,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1439391.3333333333, ans=0.125 2023-10-13 16:36:37,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1439391.3333333333, ans=0.125 2023-10-13 16:36:40,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1439438.0, ans=0.2 2023-10-13 16:36:42,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1439438.0, ans=0.0 2023-10-13 16:36:46,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1439438.0, ans=0.0 2023-10-13 16:36:50,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-10-13 16:37:08,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1439531.3333333333, ans=0.125 2023-10-13 16:37:15,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1439531.3333333333, ans=0.125 2023-10-13 16:37:27,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1439624.6666666667, ans=0.0 2023-10-13 16:37:31,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-13 16:37:42,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1439671.3333333333, ans=0.0 2023-10-13 16:37:51,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1439718.0, ans=0.0 2023-10-13 16:37:59,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439718.0, ans=0.1 2023-10-13 16:38:02,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.750e+02 1.900e+02 2.051e+02 2.511e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-13 16:38:05,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1439764.6666666667, ans=0.125 2023-10-13 16:38:22,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-13 16:38:30,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1439858.0, ans=0.125 2023-10-13 16:38:34,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.24 vs. limit=22.5 2023-10-13 16:38:43,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1439904.6666666667, ans=0.125 2023-10-13 16:39:05,193 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.62 vs. limit=15.0 2023-10-13 16:39:23,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1440044.6666666667, ans=0.125 2023-10-13 16:39:31,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1440044.6666666667, ans=0.2 2023-10-13 16:39:43,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440091.3333333333, ans=0.1 2023-10-13 16:40:02,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1440184.6666666667, ans=0.0 2023-10-13 16:40:03,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2023-10-13 16:40:09,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1440184.6666666667, ans=0.125 2023-10-13 16:40:14,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.825e+02 1.970e+02 2.326e+02 3.328e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-13 16:41:08,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1440371.3333333333, ans=0.125 2023-10-13 16:41:36,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1440464.6666666667, ans=0.125 2023-10-13 16:41:51,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1440558.0, ans=0.05 2023-10-13 16:41:53,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1440558.0, ans=0.09899494936611666 2023-10-13 16:41:56,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1440558.0, ans=0.125 2023-10-13 16:42:26,084 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.35 vs. limit=22.5 2023-10-13 16:42:32,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-13 16:42:35,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.792e+02 1.949e+02 2.201e+02 2.893e+02, threshold=3.898e+02, percent-clipped=0.0 2023-10-13 16:42:39,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=22.5 2023-10-13 16:43:07,138 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-10-13 16:43:27,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1440884.6666666667, ans=0.0 2023-10-13 16:43:32,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1440884.6666666667, ans=0.125 2023-10-13 16:43:52,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-10-13 16:44:01,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1441024.6666666667, ans=0.0 2023-10-13 16:44:05,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-10-13 16:44:06,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=15.0 2023-10-13 16:44:07,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1441024.6666666667, ans=0.2 2023-10-13 16:44:12,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-10-13 16:44:38,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.803e+02 1.996e+02 2.210e+02 3.797e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 16:44:40,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1441164.6666666667, ans=0.2 2023-10-13 16:45:44,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1441398.0, ans=0.1 2023-10-13 16:46:05,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1441491.3333333333, ans=0.125 2023-10-13 16:46:34,616 INFO [train.py:1031] (3/4) Epoch 23, batch 8500, loss[loss=0.1763, simple_loss=0.2777, pruned_loss=0.0374, over 16919.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2804, pruned_loss=0.04824, over 32317869.15 frames. ], batch size: 138, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 16:46:37,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-13 16:46:48,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.800e+02 1.973e+02 2.175e+02 2.720e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 16:47:11,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.26 vs. limit=15.0 2023-10-13 16:47:16,690 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.53 vs. limit=22.5 2023-10-13 16:47:23,124 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:47:43,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1441818.0, ans=0.125 2023-10-13 16:47:44,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1441818.0, ans=0.0 2023-10-13 16:48:06,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1441911.3333333333, ans=0.07 2023-10-13 16:48:10,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1441911.3333333333, ans=0.0 2023-10-13 16:48:27,048 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-10-13 16:48:52,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1442051.3333333333, ans=0.0 2023-10-13 16:49:00,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1442098.0, ans=0.0 2023-10-13 16:49:01,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1442098.0, ans=0.0 2023-10-13 16:49:02,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.845e+02 2.071e+02 2.367e+02 3.411e+02, threshold=4.142e+02, percent-clipped=0.0 2023-10-13 16:49:15,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1442144.6666666667, ans=0.0 2023-10-13 16:49:19,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1442144.6666666667, ans=0.0 2023-10-13 16:49:24,359 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.074e-02 2023-10-13 16:49:25,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1442191.3333333333, ans=0.0 2023-10-13 16:49:28,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-13 16:49:42,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1442238.0, ans=0.125 2023-10-13 16:49:46,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1442238.0, ans=0.025 2023-10-13 16:49:50,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1442238.0, ans=0.125 2023-10-13 16:49:53,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1442284.6666666667, ans=0.125 2023-10-13 16:49:56,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1442284.6666666667, ans=0.0 2023-10-13 16:50:18,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1442331.3333333333, ans=0.125 2023-10-13 16:50:44,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-13 16:51:09,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1442518.0, ans=0.125 2023-10-13 16:51:11,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442518.0, ans=0.1 2023-10-13 16:51:16,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1442518.0, ans=0.2 2023-10-13 16:51:21,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1442564.6666666667, ans=0.0 2023-10-13 16:51:24,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.672e+02 1.824e+02 1.947e+02 2.708e+02, threshold=3.648e+02, percent-clipped=0.0 2023-10-13 16:51:35,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1442611.3333333333, ans=0.0 2023-10-13 16:51:36,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1442611.3333333333, ans=0.0 2023-10-13 16:51:48,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1442658.0, ans=0.0 2023-10-13 16:51:56,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1442658.0, ans=0.125 2023-10-13 16:51:58,439 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:52:11,058 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.74 vs. limit=22.5 2023-10-13 16:52:28,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=22.5 2023-10-13 16:52:31,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1442798.0, ans=0.125 2023-10-13 16:52:48,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1442844.6666666667, ans=0.07 2023-10-13 16:53:16,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1442938.0, ans=0.125 2023-10-13 16:53:22,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1442984.6666666667, ans=0.125 2023-10-13 16:53:35,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1443031.3333333333, ans=0.125 2023-10-13 16:53:36,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.715e+02 1.886e+02 2.042e+02 3.010e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-13 16:53:36,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.74 vs. limit=15.0 2023-10-13 16:54:31,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1443218.0, ans=0.0 2023-10-13 16:54:45,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1443264.6666666667, ans=0.125 2023-10-13 16:54:48,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1443311.3333333333, ans=0.125 2023-10-13 16:54:55,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1443311.3333333333, ans=0.125 2023-10-13 16:54:55,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1443311.3333333333, ans=0.1 2023-10-13 16:55:07,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1443358.0, ans=0.125 2023-10-13 16:55:22,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1443404.6666666667, ans=0.2 2023-10-13 16:55:38,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.797e+02 1.951e+02 2.101e+02 2.636e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-13 16:56:04,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1443591.3333333333, ans=0.0 2023-10-13 16:56:15,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1443638.0, ans=0.125 2023-10-13 16:56:16,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1443638.0, ans=0.2 2023-10-13 16:57:17,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1443871.3333333333, ans=10.0 2023-10-13 16:57:32,129 INFO [train.py:1031] (3/4) Epoch 23, batch 9000, loss[loss=0.1809, simple_loss=0.2797, pruned_loss=0.04111, over 16853.00 frames. ], tot_loss[loss=0.188, simple_loss=0.28, pruned_loss=0.04798, over 32449382.08 frames. ], batch size: 98, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 16:57:47,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.798e+02 1.968e+02 2.304e+02 3.237e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 16:58:00,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1444011.3333333333, ans=0.125 2023-10-13 16:58:07,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-10-13 16:58:08,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1444058.0, ans=0.125 2023-10-13 16:58:13,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1444058.0, ans=0.125 2023-10-13 16:58:16,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.34 vs. limit=15.0 2023-10-13 16:58:17,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1444058.0, ans=0.125 2023-10-13 16:58:21,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1444104.6666666667, ans=0.1 2023-10-13 16:58:47,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1444198.0, ans=0.125 2023-10-13 16:58:54,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.30 vs. limit=22.5 2023-10-13 16:58:58,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1444244.6666666667, ans=0.0 2023-10-13 16:59:43,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1444384.6666666667, ans=0.0 2023-10-13 16:59:48,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.732e+02 1.938e+02 2.179e+02 3.070e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-13 17:00:15,376 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-10-13 17:00:16,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1444524.6666666667, ans=0.125 2023-10-13 17:00:28,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1444571.3333333333, ans=0.0 2023-10-13 17:00:35,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1444618.0, ans=0.0 2023-10-13 17:00:58,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1444711.3333333333, ans=0.07 2023-10-13 17:01:00,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1444711.3333333333, ans=0.04949747468305833 2023-10-13 17:01:21,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1444804.6666666667, ans=0.0 2023-10-13 17:01:31,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.85 vs. limit=15.0 2023-10-13 17:01:37,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1444851.3333333333, ans=0.125 2023-10-13 17:01:55,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.805e+02 1.966e+02 2.223e+02 3.098e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-13 17:02:02,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1444944.6666666667, ans=0.125 2023-10-13 17:02:05,123 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=22.5 2023-10-13 17:02:37,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1445038.0, ans=0.0 2023-10-13 17:02:41,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1445084.6666666667, ans=0.025 2023-10-13 17:02:46,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1445084.6666666667, ans=0.125 2023-10-13 17:03:04,989 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.26 vs. limit=15.0 2023-10-13 17:03:06,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1445178.0, ans=0.125 2023-10-13 17:03:19,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1445224.6666666667, ans=0.0 2023-10-13 17:03:38,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1445318.0, ans=0.0 2023-10-13 17:03:48,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1445364.6666666667, ans=0.125 2023-10-13 17:03:50,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1445364.6666666667, ans=0.0 2023-10-13 17:03:54,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.797e+02 1.968e+02 2.169e+02 2.957e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 17:03:58,574 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:04:15,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1445458.0, ans=0.2 2023-10-13 17:04:17,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1445458.0, ans=0.125 2023-10-13 17:04:25,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-13 17:04:31,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1445504.6666666667, ans=0.05 2023-10-13 17:04:38,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1445504.6666666667, ans=0.2 2023-10-13 17:04:55,465 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-13 17:05:01,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1445598.0, ans=0.07 2023-10-13 17:05:09,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1445644.6666666667, ans=0.125 2023-10-13 17:05:32,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1445691.3333333333, ans=0.125 2023-10-13 17:05:35,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1445738.0, ans=0.2 2023-10-13 17:05:43,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1445738.0, ans=0.1 2023-10-13 17:06:12,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.807e+02 1.980e+02 2.218e+02 3.235e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-13 17:06:59,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1445971.3333333333, ans=0.0 2023-10-13 17:07:26,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1446064.6666666667, ans=0.125 2023-10-13 17:07:50,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2023-10-13 17:07:54,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1446158.0, ans=0.05 2023-10-13 17:08:17,152 INFO [train.py:1031] (3/4) Epoch 23, batch 9500, loss[loss=0.1767, simple_loss=0.2797, pruned_loss=0.0368, over 16832.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2806, pruned_loss=0.04818, over 32499229.01 frames. ], batch size: 98, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 17:08:17,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1446251.3333333333, ans=0.0 2023-10-13 17:08:37,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.889e+02 2.062e+02 2.205e+02 3.305e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-13 17:08:59,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1446391.3333333333, ans=0.125 2023-10-13 17:09:22,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=22.5 2023-10-13 17:09:32,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1446531.3333333333, ans=0.125 2023-10-13 17:09:38,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1446531.3333333333, ans=0.0 2023-10-13 17:09:40,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1446531.3333333333, ans=0.09899494936611666 2023-10-13 17:09:48,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1446578.0, ans=0.125 2023-10-13 17:10:20,451 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.214e-01 2023-10-13 17:10:21,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2023-10-13 17:10:35,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1446764.6666666667, ans=0.09899494936611666 2023-10-13 17:10:37,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446764.6666666667, ans=0.1 2023-10-13 17:10:38,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.832e+02 1.959e+02 2.197e+02 3.091e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-13 17:10:44,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1446811.3333333333, ans=0.2 2023-10-13 17:11:00,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1446858.0, ans=0.5 2023-10-13 17:11:06,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.55 vs. limit=22.5 2023-10-13 17:11:13,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1446904.6666666667, ans=0.125 2023-10-13 17:11:14,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=15.0 2023-10-13 17:11:28,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1446951.3333333333, ans=0.0 2023-10-13 17:11:30,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1446951.3333333333, ans=0.125 2023-10-13 17:11:32,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-10-13 17:11:35,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1446951.3333333333, ans=0.125 2023-10-13 17:11:45,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1446998.0, ans=0.1 2023-10-13 17:12:07,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1447091.3333333333, ans=0.1 2023-10-13 17:12:41,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1447184.6666666667, ans=0.0 2023-10-13 17:12:49,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1447231.3333333333, ans=0.125 2023-10-13 17:12:50,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.833e+02 1.997e+02 2.164e+02 3.370e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-13 17:12:52,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1447231.3333333333, ans=0.125 2023-10-13 17:13:33,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1447418.0, ans=0.0 2023-10-13 17:13:49,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-10-13 17:13:50,549 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:14:36,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1447651.3333333333, ans=0.0 2023-10-13 17:14:46,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-13 17:14:48,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1447698.0, ans=0.0 2023-10-13 17:14:53,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.786e+02 1.944e+02 2.128e+02 3.330e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-13 17:14:55,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1447698.0, ans=0.125 2023-10-13 17:15:16,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1447791.3333333333, ans=0.125 2023-10-13 17:15:17,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1447791.3333333333, ans=0.125 2023-10-13 17:15:31,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1447838.0, ans=0.125 2023-10-13 17:15:39,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1447884.6666666667, ans=0.125 2023-10-13 17:15:39,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1447884.6666666667, ans=10.0 2023-10-13 17:15:46,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1447884.6666666667, ans=0.0 2023-10-13 17:16:14,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1447978.0, ans=0.125 2023-10-13 17:16:38,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1448071.3333333333, ans=0.125 2023-10-13 17:16:40,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1448071.3333333333, ans=0.1 2023-10-13 17:16:40,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-10-13 17:16:57,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.35 vs. limit=22.5 2023-10-13 17:16:58,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.753e+02 1.894e+02 2.074e+02 2.901e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-13 17:17:05,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1448211.3333333333, ans=0.125 2023-10-13 17:17:05,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=22.5 2023-10-13 17:17:19,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1448258.0, ans=0.125 2023-10-13 17:17:26,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.09 vs. limit=22.5 2023-10-13 17:17:27,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1448304.6666666667, ans=0.125 2023-10-13 17:17:34,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1448304.6666666667, ans=0.0 2023-10-13 17:17:52,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=22.5 2023-10-13 17:18:37,147 INFO [train.py:1031] (3/4) Epoch 23, batch 10000, loss[loss=0.2279, simple_loss=0.3011, pruned_loss=0.07739, over 16096.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.28, pruned_loss=0.04807, over 32568579.69 frames. ], batch size: 296, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 17:18:48,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1448631.3333333333, ans=0.0 2023-10-13 17:18:49,238 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:18:54,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.758e+02 1.928e+02 2.143e+02 3.649e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-13 17:19:40,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=12.0 2023-10-13 17:19:46,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1448864.6666666667, ans=0.2 2023-10-13 17:19:47,514 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:19:49,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1448864.6666666667, ans=0.0 2023-10-13 17:20:07,072 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-10-13 17:20:08,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1448911.3333333333, ans=0.125 2023-10-13 17:20:28,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1449004.6666666667, ans=0.125 2023-10-13 17:20:56,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1449098.0, ans=0.07 2023-10-13 17:20:56,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1449098.0, ans=0.125 2023-10-13 17:20:59,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.838e+02 1.989e+02 2.270e+02 3.130e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 17:21:23,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1449191.3333333333, ans=0.0 2023-10-13 17:21:29,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1449238.0, ans=0.0 2023-10-13 17:21:32,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-13 17:21:48,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1449284.6666666667, ans=0.1 2023-10-13 17:21:49,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.61 vs. limit=15.0 2023-10-13 17:22:17,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1449424.6666666667, ans=0.125 2023-10-13 17:22:45,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1449518.0, ans=0.0 2023-10-13 17:23:02,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.829e+02 1.983e+02 2.185e+02 2.990e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 17:23:24,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1449658.0, ans=0.125 2023-10-13 17:23:30,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1449658.0, ans=0.125 2023-10-13 17:23:32,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1449658.0, ans=0.125 2023-10-13 17:23:51,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1449751.3333333333, ans=0.2 2023-10-13 17:23:53,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-10-13 17:24:00,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1449751.3333333333, ans=0.125 2023-10-13 17:24:19,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1449844.6666666667, ans=0.1 2023-10-13 17:24:19,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1449844.6666666667, ans=0.0 2023-10-13 17:24:30,714 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-10-13 17:24:58,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1449938.0, ans=0.125 2023-10-13 17:25:16,182 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-10-13 17:25:25,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.855e+02 2.046e+02 2.260e+02 4.458e+02, threshold=4.092e+02, percent-clipped=1.0 2023-10-13 17:25:44,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1450124.6666666667, ans=0.125 2023-10-13 17:25:46,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-10-13 17:25:47,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1450124.6666666667, ans=0.125 2023-10-13 17:25:49,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1450124.6666666667, ans=0.1 2023-10-13 17:25:54,507 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:25:56,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1450171.3333333333, ans=0.125 2023-10-13 17:26:05,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1450171.3333333333, ans=0.0 2023-10-13 17:26:11,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1450218.0, ans=0.125 2023-10-13 17:26:32,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1450311.3333333333, ans=0.0 2023-10-13 17:26:36,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1450311.3333333333, ans=0.125 2023-10-13 17:26:39,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1450311.3333333333, ans=0.09899494936611666 2023-10-13 17:26:51,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1450358.0, ans=0.05 2023-10-13 17:27:08,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1450404.6666666667, ans=0.125 2023-10-13 17:27:30,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1450498.0, ans=0.2 2023-10-13 17:27:30,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.766e+02 1.929e+02 2.042e+02 2.650e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-13 17:27:39,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.72 vs. limit=12.0 2023-10-13 17:27:39,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-10-13 17:27:58,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1450591.3333333333, ans=0.0 2023-10-13 17:28:03,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=22.5 2023-10-13 17:28:10,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1450638.0, ans=0.0 2023-10-13 17:28:21,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1450684.6666666667, ans=0.125 2023-10-13 17:28:28,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1450731.3333333333, ans=0.125 2023-10-13 17:28:28,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1450731.3333333333, ans=0.125 2023-10-13 17:28:44,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1450778.0, ans=0.125 2023-10-13 17:29:18,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1450918.0, ans=0.125 2023-10-13 17:29:18,788 INFO [train.py:1031] (3/4) Epoch 23, batch 10500, loss[loss=0.2262, simple_loss=0.2931, pruned_loss=0.07962, over 15583.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2804, pruned_loss=0.04816, over 32628137.67 frames. ], batch size: 350, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 17:29:21,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1450918.0, ans=0.125 2023-10-13 17:29:21,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1450918.0, ans=0.0 2023-10-13 17:29:26,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1450918.0, ans=0.125 2023-10-13 17:29:27,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-10-13 17:29:31,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1450964.6666666667, ans=0.1 2023-10-13 17:29:37,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.822e+02 2.028e+02 2.260e+02 3.465e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-13 17:29:45,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1451011.3333333333, ans=0.125 2023-10-13 17:30:01,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1451058.0, ans=0.0 2023-10-13 17:30:01,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1451058.0, ans=0.125 2023-10-13 17:30:07,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.99 vs. limit=5.0 2023-10-13 17:30:36,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-10-13 17:30:52,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1451244.6666666667, ans=0.0 2023-10-13 17:31:29,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1451338.0, ans=0.125 2023-10-13 17:31:52,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1451431.3333333333, ans=0.125 2023-10-13 17:31:53,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1451431.3333333333, ans=0.2 2023-10-13 17:31:53,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.808e+02 1.990e+02 2.117e+02 2.887e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 17:32:33,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-13 17:32:42,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1451618.0, ans=0.0 2023-10-13 17:32:47,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1451618.0, ans=0.0 2023-10-13 17:32:54,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1451664.6666666667, ans=0.125 2023-10-13 17:33:05,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1451711.3333333333, ans=0.0 2023-10-13 17:33:06,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1451711.3333333333, ans=0.0 2023-10-13 17:33:07,654 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:33:15,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1451711.3333333333, ans=0.07 2023-10-13 17:33:18,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451711.3333333333, ans=0.1 2023-10-13 17:33:33,545 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2023-10-13 17:33:36,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1451804.6666666667, ans=0.0 2023-10-13 17:34:02,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.24 vs. limit=22.5 2023-10-13 17:34:06,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1451898.0, ans=0.1 2023-10-13 17:34:13,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.796e+02 1.902e+02 2.096e+02 2.558e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-13 17:34:20,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1451944.6666666667, ans=0.125 2023-10-13 17:34:45,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=12.0 2023-10-13 17:35:01,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1452084.6666666667, ans=0.2 2023-10-13 17:35:22,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1452131.3333333333, ans=0.07 2023-10-13 17:35:45,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1452224.6666666667, ans=0.125 2023-10-13 17:35:46,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1452224.6666666667, ans=0.125 2023-10-13 17:35:52,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.99 vs. limit=15.0 2023-10-13 17:36:27,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1452364.6666666667, ans=0.125 2023-10-13 17:36:27,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.870e+02 2.074e+02 2.380e+02 3.317e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-13 17:36:32,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2023-10-13 17:36:35,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1452411.3333333333, ans=0.125 2023-10-13 17:36:43,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-10-13 17:36:45,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1452458.0, ans=0.0 2023-10-13 17:36:48,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1452458.0, ans=0.125 2023-10-13 17:36:55,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1452504.6666666667, ans=0.2 2023-10-13 17:37:03,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2023-10-13 17:37:11,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1452551.3333333333, ans=0.1 2023-10-13 17:37:29,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1452598.0, ans=0.125 2023-10-13 17:37:45,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1452691.3333333333, ans=0.1 2023-10-13 17:38:03,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-10-13 17:38:06,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1452738.0, ans=0.0 2023-10-13 17:38:08,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1452784.6666666667, ans=0.125 2023-10-13 17:38:08,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1452784.6666666667, ans=0.2 2023-10-13 17:38:34,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.713e+02 1.903e+02 2.158e+02 3.027e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-13 17:38:54,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-10-13 17:39:02,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1452971.3333333333, ans=0.2 2023-10-13 17:39:08,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1452971.3333333333, ans=0.0 2023-10-13 17:39:42,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453111.3333333333, ans=0.1 2023-10-13 17:39:45,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453111.3333333333, ans=0.1 2023-10-13 17:40:19,919 INFO [train.py:1031] (3/4) Epoch 23, batch 11000, loss[loss=0.1911, simple_loss=0.291, pruned_loss=0.04563, over 16929.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2803, pruned_loss=0.04814, over 32671024.21 frames. ], batch size: 82, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 17:40:34,683 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:40:40,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.737e+02 1.956e+02 2.114e+02 2.751e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-13 17:40:45,144 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:40:58,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1453391.3333333333, ans=0.2 2023-10-13 17:41:07,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1453391.3333333333, ans=0.05 2023-10-13 17:41:31,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-13 17:41:41,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1453531.3333333333, ans=0.125 2023-10-13 17:41:45,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1453531.3333333333, ans=0.0 2023-10-13 17:42:01,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1453624.6666666667, ans=0.1 2023-10-13 17:42:17,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1453671.3333333333, ans=0.5 2023-10-13 17:42:54,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.794e+02 1.906e+02 2.126e+02 2.639e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-13 17:42:57,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1453811.3333333333, ans=0.0 2023-10-13 17:43:05,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1453811.3333333333, ans=0.125 2023-10-13 17:43:18,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1453858.0, ans=0.0 2023-10-13 17:43:31,343 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:43:52,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1453998.0, ans=0.0 2023-10-13 17:43:57,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1453998.0, ans=0.0 2023-10-13 17:44:13,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1454044.6666666667, ans=0.0 2023-10-13 17:45:02,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1454231.3333333333, ans=0.125 2023-10-13 17:45:06,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.722e+02 1.876e+02 2.078e+02 2.779e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-13 17:45:21,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1454324.6666666667, ans=0.125 2023-10-13 17:45:29,504 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:45:29,932 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=8.0 2023-10-13 17:45:32,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1454371.3333333333, ans=0.125 2023-10-13 17:45:45,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-13 17:46:05,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1454464.6666666667, ans=0.125 2023-10-13 17:46:17,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1454511.3333333333, ans=0.035 2023-10-13 17:46:31,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1454558.0, ans=0.5 2023-10-13 17:46:36,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1454558.0, ans=0.2 2023-10-13 17:47:24,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.783e+02 1.972e+02 2.199e+02 2.949e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-13 17:47:41,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1454791.3333333333, ans=0.125 2023-10-13 17:47:44,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1454791.3333333333, ans=0.2 2023-10-13 17:47:50,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.77 vs. limit=15.0 2023-10-13 17:47:52,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1454838.0, ans=0.1 2023-10-13 17:47:54,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1454838.0, ans=0.125 2023-10-13 17:49:02,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1455071.3333333333, ans=0.0 2023-10-13 17:49:03,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2023-10-13 17:49:31,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.771e+02 1.956e+02 2.311e+02 3.174e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-13 17:49:36,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1455211.3333333333, ans=0.125 2023-10-13 17:49:43,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1455211.3333333333, ans=10.0 2023-10-13 17:49:47,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1455211.3333333333, ans=0.125 2023-10-13 17:50:28,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1455351.3333333333, ans=0.2 2023-10-13 17:50:51,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1455444.6666666667, ans=0.125 2023-10-13 17:50:53,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1455491.3333333333, ans=0.1 2023-10-13 17:50:55,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1455491.3333333333, ans=0.125 2023-10-13 17:51:19,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1455584.6666666667, ans=0.1 2023-10-13 17:51:19,849 INFO [train.py:1031] (3/4) Epoch 23, batch 11500, loss[loss=0.196, simple_loss=0.298, pruned_loss=0.04702, over 16912.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2803, pruned_loss=0.04811, over 32738038.12 frames. ], batch size: 165, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 17:51:41,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.944e+02 2.132e+02 2.376e+02 3.898e+02, threshold=4.264e+02, percent-clipped=0.0 2023-10-13 17:51:48,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1455678.0, ans=0.025 2023-10-13 17:51:49,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.32 vs. limit=22.5 2023-10-13 17:52:16,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1455771.3333333333, ans=0.2 2023-10-13 17:52:40,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1455864.6666666667, ans=0.0 2023-10-13 17:52:48,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1455911.3333333333, ans=0.2 2023-10-13 17:53:04,724 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:53:07,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1455958.0, ans=0.125 2023-10-13 17:53:07,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1455958.0, ans=0.125 2023-10-13 17:53:18,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.56 vs. limit=10.0 2023-10-13 17:53:25,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1456004.6666666667, ans=0.2 2023-10-13 17:53:37,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1456051.3333333333, ans=0.125 2023-10-13 17:53:53,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.812e+02 1.990e+02 2.231e+02 3.222e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 17:53:55,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1456144.6666666667, ans=0.0 2023-10-13 17:53:56,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1456144.6666666667, ans=0.2 2023-10-13 17:54:26,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1456238.0, ans=0.125 2023-10-13 17:55:23,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1456471.3333333333, ans=0.0 2023-10-13 17:55:47,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456564.6666666667, ans=0.1 2023-10-13 17:55:52,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.817e+02 1.994e+02 2.287e+02 3.245e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-13 17:56:06,897 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-10-13 17:56:11,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1456658.0, ans=0.0 2023-10-13 17:56:44,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1456751.3333333333, ans=0.125 2023-10-13 17:56:44,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1456751.3333333333, ans=0.125 2023-10-13 17:56:54,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456798.0, ans=0.1 2023-10-13 17:57:11,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456844.6666666667, ans=0.1 2023-10-13 17:57:30,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1456891.3333333333, ans=0.0 2023-10-13 17:57:30,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1456891.3333333333, ans=0.2 2023-10-13 17:57:52,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1456984.6666666667, ans=0.2 2023-10-13 17:57:54,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.36 vs. limit=15.0 2023-10-13 17:57:58,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1456984.6666666667, ans=0.125 2023-10-13 17:58:11,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.726e+02 1.960e+02 2.165e+02 3.007e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-13 17:58:50,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1457171.3333333333, ans=0.2 2023-10-13 17:58:52,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1457218.0, ans=0.125 2023-10-13 17:58:58,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1457218.0, ans=0.04949747468305833 2023-10-13 17:59:09,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1457264.6666666667, ans=0.125 2023-10-13 17:59:25,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1457311.3333333333, ans=0.0 2023-10-13 17:59:32,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1457358.0, ans=0.0 2023-10-13 17:59:44,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1457404.6666666667, ans=0.0 2023-10-13 17:59:46,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1457404.6666666667, ans=0.125 2023-10-13 17:59:47,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1457404.6666666667, ans=0.125 2023-10-13 17:59:52,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-13 18:00:05,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1457451.3333333333, ans=0.125 2023-10-13 18:00:20,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.776e+02 1.885e+02 2.090e+02 2.844e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-13 18:00:22,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457498.0, ans=0.1 2023-10-13 18:00:41,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1457591.3333333333, ans=0.05 2023-10-13 18:00:49,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1457638.0, ans=0.125 2023-10-13 18:01:08,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1457684.6666666667, ans=0.0 2023-10-13 18:01:31,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1457731.3333333333, ans=0.125 2023-10-13 18:02:09,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1457871.3333333333, ans=0.0 2023-10-13 18:02:16,805 INFO [train.py:1031] (3/4) Epoch 23, batch 12000, loss[loss=0.1845, simple_loss=0.2805, pruned_loss=0.0442, over 16921.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2803, pruned_loss=0.04792, over 32771474.25 frames. ], batch size: 110, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 18:02:37,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1457964.6666666667, ans=0.07 2023-10-13 18:02:42,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1457964.6666666667, ans=0.0 2023-10-13 18:02:43,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.821e+02 1.990e+02 2.270e+02 2.834e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 18:02:55,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1458011.3333333333, ans=0.0 2023-10-13 18:03:04,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458058.0, ans=0.1 2023-10-13 18:03:25,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1458151.3333333333, ans=0.0 2023-10-13 18:03:56,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1458244.6666666667, ans=0.2 2023-10-13 18:04:06,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1458291.3333333333, ans=0.09899494936611666 2023-10-13 18:04:13,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1458291.3333333333, ans=0.07 2023-10-13 18:04:17,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1458338.0, ans=0.2 2023-10-13 18:04:44,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.61 vs. limit=6.0 2023-10-13 18:04:51,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.690e+02 1.823e+02 2.028e+02 3.202e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-13 18:05:23,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1458571.3333333333, ans=0.125 2023-10-13 18:05:24,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.83 vs. limit=15.0 2023-10-13 18:05:32,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1458618.0, ans=0.2 2023-10-13 18:05:39,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1458618.0, ans=0.1 2023-10-13 18:05:48,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1458664.6666666667, ans=0.1 2023-10-13 18:06:28,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1458804.6666666667, ans=0.0 2023-10-13 18:06:38,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1458851.3333333333, ans=0.0 2023-10-13 18:06:48,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1458898.0, ans=0.2 2023-10-13 18:06:54,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1458898.0, ans=0.125 2023-10-13 18:06:54,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.777e+02 1.943e+02 2.086e+02 2.933e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 18:07:06,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1458944.6666666667, ans=0.125 2023-10-13 18:07:12,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1458991.3333333333, ans=0.125 2023-10-13 18:07:39,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1459084.6666666667, ans=0.0 2023-10-13 18:08:00,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1459131.3333333333, ans=0.125 2023-10-13 18:08:01,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1459178.0, ans=0.125 2023-10-13 18:08:08,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1459178.0, ans=0.0 2023-10-13 18:08:10,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1459178.0, ans=0.125 2023-10-13 18:08:13,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-10-13 18:08:18,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1459224.6666666667, ans=0.0 2023-10-13 18:08:48,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1459318.0, ans=0.125 2023-10-13 18:08:49,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1459364.6666666667, ans=0.0 2023-10-13 18:08:59,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1459364.6666666667, ans=0.0 2023-10-13 18:08:59,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.73 vs. limit=22.5 2023-10-13 18:09:00,126 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.840e+02 2.088e+02 2.354e+02 3.662e+02, threshold=4.176e+02, percent-clipped=0.0 2023-10-13 18:09:11,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-13 18:09:23,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1459458.0, ans=0.125 2023-10-13 18:09:24,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=22.5 2023-10-13 18:09:39,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1459551.3333333333, ans=0.05 2023-10-13 18:09:42,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1459551.3333333333, ans=0.0 2023-10-13 18:09:42,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1459551.3333333333, ans=0.125 2023-10-13 18:09:44,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1459551.3333333333, ans=0.0 2023-10-13 18:10:01,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1459598.0, ans=0.07 2023-10-13 18:10:03,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1459598.0, ans=0.0 2023-10-13 18:10:13,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-13 18:10:22,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1459691.3333333333, ans=0.125 2023-10-13 18:10:23,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1459691.3333333333, ans=0.5 2023-10-13 18:10:52,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1459784.6666666667, ans=0.125 2023-10-13 18:10:58,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1459831.3333333333, ans=0.125 2023-10-13 18:11:07,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.812e+02 1.961e+02 2.176e+02 3.430e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-13 18:11:14,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1459878.0, ans=0.015 2023-10-13 18:11:51,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1460018.0, ans=0.125 2023-10-13 18:12:05,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.90 vs. limit=15.0 2023-10-13 18:12:06,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1460064.6666666667, ans=0.0 2023-10-13 18:12:09,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1460064.6666666667, ans=0.125 2023-10-13 18:12:21,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1460111.3333333333, ans=0.05 2023-10-13 18:12:34,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1460158.0, ans=10.0 2023-10-13 18:12:46,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1460204.6666666667, ans=0.125 2023-10-13 18:12:49,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=22.5 2023-10-13 18:12:52,044 INFO [train.py:1031] (3/4) Epoch 23, batch 12500, loss[loss=0.1913, simple_loss=0.2851, pruned_loss=0.04881, over 16929.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2801, pruned_loss=0.04787, over 32777924.13 frames. ], batch size: 123, lr: 1.48e-03, grad_scale: 32.0 2023-10-13 18:12:52,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1460251.3333333333, ans=0.1 2023-10-13 18:12:58,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1460251.3333333333, ans=0.125 2023-10-13 18:13:04,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1460298.0, ans=0.0 2023-10-13 18:13:15,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.739e+02 1.875e+02 2.029e+02 2.568e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 18:13:19,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1460344.6666666667, ans=0.2 2023-10-13 18:14:18,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1460578.0, ans=0.125 2023-10-13 18:14:47,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1460671.3333333333, ans=0.0 2023-10-13 18:15:06,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1460764.6666666667, ans=0.1 2023-10-13 18:15:07,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1460764.6666666667, ans=0.125 2023-10-13 18:15:14,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.810e+02 1.919e+02 2.200e+02 3.662e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-13 18:15:19,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1460811.3333333333, ans=0.95 2023-10-13 18:15:26,048 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.26 vs. limit=10.0 2023-10-13 18:15:39,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1460904.6666666667, ans=0.0 2023-10-13 18:16:13,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1460998.0, ans=0.125 2023-10-13 18:16:13,255 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.148e-02 2023-10-13 18:16:21,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1461044.6666666667, ans=0.125 2023-10-13 18:16:40,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-10-13 18:16:52,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1461138.0, ans=0.125 2023-10-13 18:16:59,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1461184.6666666667, ans=0.125 2023-10-13 18:17:14,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.73 vs. limit=10.0 2023-10-13 18:17:17,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.33 vs. limit=10.0 2023-10-13 18:17:19,866 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.782e+02 1.997e+02 2.272e+02 3.404e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-13 18:17:20,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1461278.0, ans=0.0 2023-10-13 18:17:24,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461278.0, ans=0.1 2023-10-13 18:17:54,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1461371.3333333333, ans=0.0 2023-10-13 18:17:59,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.61 vs. limit=10.0 2023-10-13 18:18:04,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1461418.0, ans=0.1 2023-10-13 18:18:32,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1461558.0, ans=0.125 2023-10-13 18:18:41,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1461558.0, ans=0.125 2023-10-13 18:18:41,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1461558.0, ans=0.125 2023-10-13 18:18:44,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1461604.6666666667, ans=0.0 2023-10-13 18:18:48,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461604.6666666667, ans=0.1 2023-10-13 18:19:10,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1461698.0, ans=0.2 2023-10-13 18:19:18,930 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.780e+02 2.022e+02 2.250e+02 3.099e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-13 18:19:32,402 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-10-13 18:19:50,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1461838.0, ans=0.125 2023-10-13 18:19:52,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1461838.0, ans=0.125 2023-10-13 18:19:56,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1461838.0, ans=0.2 2023-10-13 18:20:17,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-10-13 18:20:25,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1461978.0, ans=0.125 2023-10-13 18:20:34,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1461978.0, ans=0.125 2023-10-13 18:20:34,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.28 vs. limit=15.0 2023-10-13 18:21:01,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.09 vs. limit=22.5 2023-10-13 18:21:06,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-13 18:21:08,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1462118.0, ans=0.0 2023-10-13 18:21:21,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.714e+02 1.898e+02 2.134e+02 3.154e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-13 18:21:24,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-13 18:21:25,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1462211.3333333333, ans=0.1 2023-10-13 18:21:45,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-13 18:21:47,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1462304.6666666667, ans=0.125 2023-10-13 18:22:12,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1462398.0, ans=0.125 2023-10-13 18:22:14,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1462398.0, ans=0.125 2023-10-13 18:22:20,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1462444.6666666667, ans=0.125 2023-10-13 18:22:21,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1462444.6666666667, ans=0.025 2023-10-13 18:22:27,727 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:22:41,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1462538.0, ans=0.0 2023-10-13 18:22:50,110 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=15.0 2023-10-13 18:22:54,107 INFO [train.py:1031] (3/4) Epoch 23, batch 13000, loss[loss=0.2001, simple_loss=0.2873, pruned_loss=0.05646, over 16552.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2804, pruned_loss=0.04779, over 32790100.48 frames. ], batch size: 56, lr: 1.48e-03, grad_scale: 32.0 2023-10-13 18:22:54,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1462584.6666666667, ans=0.2 2023-10-13 18:23:01,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-13 18:23:06,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1462631.3333333333, ans=0.0 2023-10-13 18:23:08,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1462631.3333333333, ans=0.0 2023-10-13 18:23:09,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.55 vs. limit=15.0 2023-10-13 18:23:12,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1462631.3333333333, ans=0.0 2023-10-13 18:23:16,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.745e+02 1.904e+02 2.101e+02 2.810e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 18:23:26,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1462678.0, ans=0.0 2023-10-13 18:23:37,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1462724.6666666667, ans=0.125 2023-10-13 18:23:43,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1462724.6666666667, ans=0.125 2023-10-13 18:23:52,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2023-10-13 18:24:04,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1462818.0, ans=0.0 2023-10-13 18:24:23,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1462864.6666666667, ans=0.125 2023-10-13 18:24:29,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1462911.3333333333, ans=15.0 2023-10-13 18:24:44,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-10-13 18:25:21,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1463098.0, ans=0.0 2023-10-13 18:25:28,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.826e+02 2.078e+02 2.393e+02 3.197e+02, threshold=4.156e+02, percent-clipped=0.0 2023-10-13 18:25:50,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1463191.3333333333, ans=10.0 2023-10-13 18:26:24,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1463331.3333333333, ans=0.125 2023-10-13 18:26:24,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463331.3333333333, ans=0.1 2023-10-13 18:26:42,947 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:26:47,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1463424.6666666667, ans=0.125 2023-10-13 18:27:05,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=15.0 2023-10-13 18:27:21,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1463518.0, ans=0.0 2023-10-13 18:27:31,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1463564.6666666667, ans=0.0 2023-10-13 18:27:34,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1463564.6666666667, ans=0.2 2023-10-13 18:27:43,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.774e+02 1.939e+02 2.173e+02 2.766e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-13 18:27:47,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1463611.3333333333, ans=0.0 2023-10-13 18:28:06,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1463704.6666666667, ans=0.2 2023-10-13 18:28:17,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1463704.6666666667, ans=0.0 2023-10-13 18:28:26,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1463751.3333333333, ans=0.0 2023-10-13 18:28:37,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1463798.0, ans=0.125 2023-10-13 18:28:51,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1463844.6666666667, ans=0.125 2023-10-13 18:28:59,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1463891.3333333333, ans=0.07 2023-10-13 18:29:02,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1463891.3333333333, ans=0.025 2023-10-13 18:29:03,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.15 vs. limit=15.0 2023-10-13 18:29:03,318 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-10-13 18:29:04,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-10-13 18:29:25,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-13 18:29:41,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1464031.3333333333, ans=0.95 2023-10-13 18:29:48,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1464031.3333333333, ans=0.125 2023-10-13 18:29:50,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.736e+02 1.970e+02 2.207e+02 3.243e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-13 18:30:05,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1464124.6666666667, ans=0.0 2023-10-13 18:30:13,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1464171.3333333333, ans=0.125 2023-10-13 18:30:29,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1464218.0, ans=0.1 2023-10-13 18:30:35,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-10-13 18:30:51,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1464311.3333333333, ans=0.2 2023-10-13 18:31:06,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1464358.0, ans=0.125 2023-10-13 18:31:07,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1464358.0, ans=0.125 2023-10-13 18:31:08,583 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:31:17,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1464404.6666666667, ans=0.0 2023-10-13 18:31:18,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1464404.6666666667, ans=0.2 2023-10-13 18:31:21,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1464404.6666666667, ans=0.125 2023-10-13 18:31:27,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1464451.3333333333, ans=0.125 2023-10-13 18:31:29,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1464451.3333333333, ans=0.0 2023-10-13 18:31:36,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=22.5 2023-10-13 18:31:37,398 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:31:42,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1464498.0, ans=0.125 2023-10-13 18:31:49,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.784e+02 1.992e+02 2.226e+02 2.997e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-13 18:32:50,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1464778.0, ans=0.125 2023-10-13 18:32:52,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1464778.0, ans=0.0 2023-10-13 18:33:04,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1464824.6666666667, ans=0.0 2023-10-13 18:33:07,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1464824.6666666667, ans=0.04949747468305833 2023-10-13 18:33:20,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1464871.3333333333, ans=0.125 2023-10-13 18:33:22,227 INFO [train.py:1031] (3/4) Epoch 23, batch 13500, loss[loss=0.2073, simple_loss=0.2987, pruned_loss=0.05795, over 16894.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2798, pruned_loss=0.04755, over 32846922.37 frames. ], batch size: 116, lr: 1.48e-03, grad_scale: 16.0 2023-10-13 18:33:27,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1464918.0, ans=0.2 2023-10-13 18:33:31,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-13 18:33:43,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.793e+02 1.924e+02 2.202e+02 3.691e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-13 18:33:47,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1465011.3333333333, ans=0.125 2023-10-13 18:33:54,929 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:34:02,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1465058.0, ans=0.125 2023-10-13 18:34:17,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1465104.6666666667, ans=0.0 2023-10-13 18:34:18,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1465104.6666666667, ans=0.125 2023-10-13 18:34:19,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1465151.3333333333, ans=0.125 2023-10-13 18:34:24,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1465151.3333333333, ans=0.125 2023-10-13 18:34:38,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-10-13 18:34:59,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1465291.3333333333, ans=0.0 2023-10-13 18:35:10,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1465338.0, ans=0.0 2023-10-13 18:35:21,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1465384.6666666667, ans=0.2 2023-10-13 18:35:43,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2023-10-13 18:35:45,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.799e+02 1.943e+02 2.187e+02 3.695e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 18:36:50,110 INFO [train.py:1031] (3/4) Epoch 24, batch 0, loss[loss=0.174, simple_loss=0.2702, pruned_loss=0.03891, over 16187.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2702, pruned_loss=0.03891, over 16187.00 frames. ], batch size: 44, lr: 1.45e-03, grad_scale: 32.0 2023-10-13 18:36:50,112 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-13 18:36:58,758 INFO [train.py:1063] (3/4) Epoch 24, validation: loss=0.2142, simple_loss=0.3011, pruned_loss=0.06363, over 1020973.00 frames. 2023-10-13 18:36:58,759 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-13 18:36:59,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-10-13 18:37:06,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1465641.3333333333, ans=0.0 2023-10-13 18:37:17,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1465688.0, ans=0.0 2023-10-13 18:37:19,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1465688.0, ans=0.125 2023-10-13 18:37:21,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1465688.0, ans=0.2 2023-10-13 18:37:45,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1465781.3333333333, ans=0.0 2023-10-13 18:37:47,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1465828.0, ans=0.1 2023-10-13 18:37:50,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1465828.0, ans=0.1 2023-10-13 18:38:03,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1465874.6666666667, ans=0.125 2023-10-13 18:38:04,660 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:38:09,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1465874.6666666667, ans=0.125 2023-10-13 18:38:19,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1465921.3333333333, ans=0.125 2023-10-13 18:38:20,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.806e+02 1.945e+02 2.264e+02 4.686e+02, threshold=3.890e+02, percent-clipped=3.0 2023-10-13 18:38:42,121 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-13 18:38:43,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1466014.6666666667, ans=0.125 2023-10-13 18:38:51,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-13 18:39:09,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1466108.0, ans=0.125 2023-10-13 18:39:10,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2023-10-13 18:39:16,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466154.6666666667, ans=0.1 2023-10-13 18:39:41,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1466248.0, ans=0.2 2023-10-13 18:39:44,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1466248.0, ans=0.125 2023-10-13 18:40:02,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1466294.6666666667, ans=0.07 2023-10-13 18:40:25,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.727e+02 1.913e+02 2.109e+02 3.118e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-13 18:40:49,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1466481.3333333333, ans=0.125 2023-10-13 18:40:55,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1466481.3333333333, ans=0.05 2023-10-13 18:41:03,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1466528.0, ans=0.2 2023-10-13 18:41:11,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1466574.6666666667, ans=0.0 2023-10-13 18:41:13,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1466574.6666666667, ans=0.125 2023-10-13 18:41:17,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1466574.6666666667, ans=0.07 2023-10-13 18:41:19,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1466574.6666666667, ans=0.125 2023-10-13 18:41:29,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.75 vs. limit=22.5 2023-10-13 18:41:48,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=22.5 2023-10-13 18:41:50,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1466714.6666666667, ans=0.125 2023-10-13 18:41:56,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1466714.6666666667, ans=0.125 2023-10-13 18:42:34,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1466854.6666666667, ans=0.0 2023-10-13 18:42:36,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.868e+02 2.065e+02 2.377e+02 3.183e+02, threshold=4.129e+02, percent-clipped=0.0 2023-10-13 18:42:44,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466901.3333333333, ans=0.1 2023-10-13 18:43:09,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1466994.6666666667, ans=0.0 2023-10-13 18:43:31,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1467041.3333333333, ans=0.0 2023-10-13 18:43:49,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-10-13 18:43:52,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1467134.6666666667, ans=0.025 2023-10-13 18:43:57,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1467134.6666666667, ans=0.2 2023-10-13 18:44:04,662 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:44:10,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1467181.3333333333, ans=0.0 2023-10-13 18:44:13,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1467181.3333333333, ans=0.125 2023-10-13 18:44:16,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1467228.0, ans=0.0 2023-10-13 18:44:25,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1467228.0, ans=0.2 2023-10-13 18:44:39,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2023-10-13 18:44:52,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1467321.3333333333, ans=0.2 2023-10-13 18:44:54,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.781e+02 1.937e+02 2.175e+02 2.924e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-13 18:44:56,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1467321.3333333333, ans=0.2 2023-10-13 18:45:00,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2023-10-13 18:45:05,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.18 vs. limit=15.0 2023-10-13 18:45:06,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467368.0, ans=0.1 2023-10-13 18:45:37,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1467508.0, ans=0.125 2023-10-13 18:45:38,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1467508.0, ans=0.125 2023-10-13 18:45:50,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=22.5 2023-10-13 18:45:50,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.46 vs. limit=15.0 2023-10-13 18:46:13,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=15.0 2023-10-13 18:46:21,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1467694.6666666667, ans=0.1 2023-10-13 18:46:39,936 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-10-13 18:46:46,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1467741.3333333333, ans=0.0 2023-10-13 18:46:57,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.787e+02 1.967e+02 2.229e+02 3.511e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 18:47:19,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1467881.3333333333, ans=0.025 2023-10-13 18:47:35,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1467928.0, ans=0.0 2023-10-13 18:47:38,661 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-10-13 18:47:40,046 INFO [train.py:1031] (3/4) Epoch 24, batch 500, loss[loss=0.1627, simple_loss=0.2561, pruned_loss=0.03463, over 16834.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2797, pruned_loss=0.04752, over 7303860.20 frames. ], batch size: 116, lr: 1.45e-03, grad_scale: 16.0 2023-10-13 18:47:42,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1467974.6666666667, ans=0.125 2023-10-13 18:47:52,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1468021.3333333333, ans=0.0 2023-10-13 18:48:01,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1468021.3333333333, ans=0.125 2023-10-13 18:48:11,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1468068.0, ans=0.125 2023-10-13 18:48:21,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1468114.6666666667, ans=0.125 2023-10-13 18:48:54,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=12.0 2023-10-13 18:49:02,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.802e+02 2.023e+02 2.284e+02 3.778e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-13 18:49:17,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1468348.0, ans=0.1 2023-10-13 18:49:40,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1468441.3333333333, ans=0.015 2023-10-13 18:49:51,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1468441.3333333333, ans=0.0 2023-10-13 18:49:58,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1468488.0, ans=0.125 2023-10-13 18:50:09,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1468534.6666666667, ans=0.1 2023-10-13 18:50:21,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.71 vs. limit=22.5 2023-10-13 18:50:25,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1468581.3333333333, ans=0.1 2023-10-13 18:50:51,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1468721.3333333333, ans=0.0 2023-10-13 18:50:57,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.63 vs. limit=15.0 2023-10-13 18:51:01,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.816e+02 1.990e+02 2.224e+02 3.389e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 18:51:26,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1468814.6666666667, ans=0.0 2023-10-13 18:51:45,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-13 18:51:54,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1468954.6666666667, ans=0.05 2023-10-13 18:52:08,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2023-10-13 18:52:19,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1469048.0, ans=0.125 2023-10-13 18:52:24,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1469048.0, ans=0.125 2023-10-13 18:52:44,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.0 2023-10-13 18:53:02,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-13 18:53:06,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1469188.0, ans=0.125 2023-10-13 18:53:07,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.816e+02 1.990e+02 2.236e+02 2.946e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 18:53:18,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1469234.6666666667, ans=0.1 2023-10-13 18:53:23,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1469281.3333333333, ans=0.0 2023-10-13 18:53:29,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1469281.3333333333, ans=0.0 2023-10-13 18:53:35,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1469328.0, ans=0.125 2023-10-13 18:53:37,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1469328.0, ans=0.05 2023-10-13 18:53:55,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2023-10-13 18:53:56,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1469374.6666666667, ans=0.125 2023-10-13 18:54:06,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1469421.3333333333, ans=0.2 2023-10-13 18:54:33,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1469514.6666666667, ans=0.125 2023-10-13 18:54:33,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1469514.6666666667, ans=0.5 2023-10-13 18:54:41,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1469514.6666666667, ans=0.0 2023-10-13 18:54:52,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1469561.3333333333, ans=0.05 2023-10-13 18:54:55,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1469561.3333333333, ans=0.0 2023-10-13 18:54:55,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1469561.3333333333, ans=0.2 2023-10-13 18:54:55,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1469561.3333333333, ans=0.125 2023-10-13 18:55:00,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1469608.0, ans=0.2 2023-10-13 18:55:04,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1469608.0, ans=0.125 2023-10-13 18:55:12,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1469608.0, ans=0.1 2023-10-13 18:55:16,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1469654.6666666667, ans=0.0 2023-10-13 18:55:24,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.782e+02 1.952e+02 2.079e+02 2.989e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-13 18:55:24,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1469654.6666666667, ans=0.0 2023-10-13 18:56:04,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1469841.3333333333, ans=0.0 2023-10-13 18:56:05,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=22.5 2023-10-13 18:56:06,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-10-13 18:56:15,946 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2023-10-13 18:56:41,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-13 18:57:24,416 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.77 vs. limit=15.0 2023-10-13 18:57:26,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.783e+02 1.926e+02 2.116e+02 2.800e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-13 18:57:43,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1470214.6666666667, ans=0.1 2023-10-13 18:58:02,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-10-13 18:58:06,552 INFO [train.py:1031] (3/4) Epoch 24, batch 1000, loss[loss=0.179, simple_loss=0.2835, pruned_loss=0.0372, over 16895.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2808, pruned_loss=0.04846, over 12931694.50 frames. ], batch size: 87, lr: 1.45e-03, grad_scale: 16.0 2023-10-13 18:59:12,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1470588.0, ans=0.125 2023-10-13 18:59:14,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1470588.0, ans=0.125 2023-10-13 18:59:17,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1470588.0, ans=0.125 2023-10-13 18:59:20,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.824e+02 2.058e+02 2.355e+02 3.143e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-13 18:59:27,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1470634.6666666667, ans=0.1 2023-10-13 18:59:58,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1470774.6666666667, ans=0.125 2023-10-13 19:00:00,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1470774.6666666667, ans=0.1 2023-10-13 19:00:16,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1470821.3333333333, ans=0.125 2023-10-13 19:00:16,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.18 vs. limit=15.0 2023-10-13 19:00:27,020 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-10-13 19:00:30,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1470868.0, ans=0.0 2023-10-13 19:00:33,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1470914.6666666667, ans=0.125 2023-10-13 19:00:58,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.01 vs. limit=22.5 2023-10-13 19:01:00,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1471008.0, ans=0.0 2023-10-13 19:01:01,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-13 19:01:02,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1471008.0, ans=0.0 2023-10-13 19:01:21,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.814e+02 2.002e+02 2.279e+02 3.571e+02, threshold=4.005e+02, percent-clipped=0.0 2023-10-13 19:01:21,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1471054.6666666667, ans=0.0 2023-10-13 19:01:42,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1471148.0, ans=0.025 2023-10-13 19:01:45,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1471148.0, ans=0.125 2023-10-13 19:01:54,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-10-13 19:01:57,679 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-10-13 19:02:09,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1471241.3333333333, ans=0.125 2023-10-13 19:02:16,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1471288.0, ans=0.09899494936611666 2023-10-13 19:02:16,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1471288.0, ans=0.125 2023-10-13 19:02:22,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-13 19:02:34,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1471334.6666666667, ans=0.0 2023-10-13 19:03:21,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1471521.3333333333, ans=0.2 2023-10-13 19:03:22,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.750e+02 1.969e+02 2.177e+02 2.925e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-13 19:03:30,293 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:04:00,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1471708.0, ans=0.125 2023-10-13 19:04:31,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1471848.0, ans=0.125 2023-10-13 19:04:54,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1471941.3333333333, ans=0.125 2023-10-13 19:04:57,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471941.3333333333, ans=0.1 2023-10-13 19:05:04,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1471988.0, ans=0.0 2023-10-13 19:05:08,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.803e+02 2.027e+02 2.199e+02 3.657e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-13 19:05:10,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.72 vs. limit=15.0 2023-10-13 19:05:15,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1472034.6666666667, ans=0.125 2023-10-13 19:05:37,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1472128.0, ans=0.2 2023-10-13 19:05:46,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1472128.0, ans=0.125 2023-10-13 19:05:51,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2023-10-13 19:05:52,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1472174.6666666667, ans=0.2 2023-10-13 19:05:57,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-13 19:06:20,008 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:06:42,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1472361.3333333333, ans=0.0 2023-10-13 19:06:57,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1472408.0, ans=0.0 2023-10-13 19:07:07,518 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:07:10,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1472454.6666666667, ans=0.2 2023-10-13 19:07:14,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.785e+02 1.963e+02 2.244e+02 3.189e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-13 19:07:16,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1472501.3333333333, ans=0.0 2023-10-13 19:07:36,451 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:07:51,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1472594.6666666667, ans=0.2 2023-10-13 19:07:51,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1472594.6666666667, ans=0.1 2023-10-13 19:07:55,291 INFO [train.py:1031] (3/4) Epoch 24, batch 1500, loss[loss=0.1654, simple_loss=0.253, pruned_loss=0.03884, over 16280.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2792, pruned_loss=0.04764, over 17336520.05 frames. ], batch size: 44, lr: 1.45e-03, grad_scale: 32.0 2023-10-13 19:07:59,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-10-13 19:08:16,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1472688.0, ans=0.125 2023-10-13 19:08:45,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1472828.0, ans=0.125 2023-10-13 19:08:47,576 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:09:06,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1472921.3333333333, ans=0.0 2023-10-13 19:09:15,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.860e+02 1.973e+02 2.294e+02 2.782e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-13 19:09:17,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-10-13 19:09:19,130 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-10-13 19:09:34,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=12.0 2023-10-13 19:09:36,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1473014.6666666667, ans=0.0 2023-10-13 19:09:52,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-10-13 19:10:27,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1473248.0, ans=0.125 2023-10-13 19:10:46,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1473294.6666666667, ans=0.2 2023-10-13 19:10:56,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1473341.3333333333, ans=0.1 2023-10-13 19:11:20,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.772e+02 1.963e+02 2.196e+02 3.044e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-13 19:11:46,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1473528.0, ans=0.125 2023-10-13 19:11:52,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1473574.6666666667, ans=0.0 2023-10-13 19:11:59,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.45 vs. limit=22.5 2023-10-13 19:12:03,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1473621.3333333333, ans=0.125 2023-10-13 19:12:15,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1473668.0, ans=0.125 2023-10-13 19:12:20,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1473668.0, ans=10.0 2023-10-13 19:12:28,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1473714.6666666667, ans=0.125 2023-10-13 19:12:34,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1473714.6666666667, ans=0.09899494936611666 2023-10-13 19:12:45,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1473761.3333333333, ans=0.125 2023-10-13 19:13:09,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.838e+02 1.969e+02 2.181e+02 3.183e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-13 19:13:18,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1473901.3333333333, ans=0.2 2023-10-13 19:13:33,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1473948.0, ans=0.125 2023-10-13 19:13:50,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1474041.3333333333, ans=0.125 2023-10-13 19:13:53,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1474041.3333333333, ans=0.02 2023-10-13 19:13:56,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1474041.3333333333, ans=0.95 2023-10-13 19:14:18,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1474134.6666666667, ans=0.125 2023-10-13 19:14:28,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1474181.3333333333, ans=0.2 2023-10-13 19:14:53,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474228.0, ans=0.1 2023-10-13 19:15:07,555 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=15.0 2023-10-13 19:15:15,943 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2023-10-13 19:15:18,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.785e+02 2.024e+02 2.351e+02 3.117e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-13 19:15:40,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474461.3333333333, ans=0.1 2023-10-13 19:16:01,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1474508.0, ans=0.0 2023-10-13 19:16:01,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1474508.0, ans=0.125 2023-10-13 19:16:07,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.32 vs. limit=15.0 2023-10-13 19:16:12,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1474554.6666666667, ans=0.125 2023-10-13 19:16:19,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1474601.3333333333, ans=10.0 2023-10-13 19:16:29,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474648.0, ans=0.1 2023-10-13 19:17:10,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1474788.0, ans=10.0 2023-10-13 19:17:23,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1474788.0, ans=0.0 2023-10-13 19:17:25,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.794e+02 2.001e+02 2.229e+02 3.169e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-13 19:17:25,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1474834.6666666667, ans=0.0 2023-10-13 19:17:31,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1474834.6666666667, ans=0.125 2023-10-13 19:17:46,338 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.37 vs. limit=10.0 2023-10-13 19:17:49,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-13 19:18:03,034 INFO [train.py:1031] (3/4) Epoch 24, batch 2000, loss[loss=0.1837, simple_loss=0.2827, pruned_loss=0.04233, over 16893.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2798, pruned_loss=0.04781, over 20739312.59 frames. ], batch size: 130, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 19:18:53,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1475114.6666666667, ans=0.125 2023-10-13 19:19:01,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1475161.3333333333, ans=0.0 2023-10-13 19:19:12,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475208.0, ans=0.1 2023-10-13 19:19:19,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475208.0, ans=0.1 2023-10-13 19:19:23,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.90 vs. limit=10.0 2023-10-13 19:19:33,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.12 vs. limit=15.0 2023-10-13 19:19:36,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1475254.6666666667, ans=0.2 2023-10-13 19:19:40,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.703e+02 1.840e+02 2.148e+02 2.927e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-13 19:19:40,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1475301.3333333333, ans=0.125 2023-10-13 19:19:43,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1475301.3333333333, ans=0.125 2023-10-13 19:19:50,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1475348.0, ans=0.0 2023-10-13 19:20:46,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1475488.0, ans=0.5 2023-10-13 19:21:01,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1475534.6666666667, ans=0.0 2023-10-13 19:21:31,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1475628.0, ans=0.125 2023-10-13 19:21:37,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1475628.0, ans=0.125 2023-10-13 19:21:55,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-10-13 19:22:07,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475721.3333333333, ans=0.1 2023-10-13 19:22:17,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.866e+02 2.031e+02 2.269e+02 3.413e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-13 19:22:19,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1475768.0, ans=0.0 2023-10-13 19:22:23,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-10-13 19:22:35,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1475814.6666666667, ans=0.0 2023-10-13 19:22:36,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-10-13 19:23:05,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1475954.6666666667, ans=0.09899494936611666 2023-10-13 19:23:07,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1475954.6666666667, ans=0.2 2023-10-13 19:23:33,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-10-13 19:23:34,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1476048.0, ans=0.125 2023-10-13 19:23:51,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1476141.3333333333, ans=0.125 2023-10-13 19:23:55,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1476141.3333333333, ans=0.0 2023-10-13 19:23:56,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.47 vs. limit=22.5 2023-10-13 19:24:06,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476188.0, ans=0.1 2023-10-13 19:24:16,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.848e+02 1.982e+02 2.300e+02 3.427e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 19:24:22,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1476234.6666666667, ans=0.1 2023-10-13 19:25:07,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1476421.3333333333, ans=0.125 2023-10-13 19:25:09,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-13 19:25:49,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476608.0, ans=0.1 2023-10-13 19:26:05,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.55 vs. limit=15.0 2023-10-13 19:26:15,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.818e+02 1.942e+02 2.194e+02 3.470e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 19:26:34,131 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.39 vs. limit=5.0 2023-10-13 19:26:37,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1476748.0, ans=0.125 2023-10-13 19:26:56,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1476841.3333333333, ans=0.0 2023-10-13 19:27:02,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.47 vs. limit=22.5 2023-10-13 19:27:26,349 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:27:30,896 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-10-13 19:27:35,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.56 vs. limit=22.5 2023-10-13 19:27:36,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1477028.0, ans=0.125 2023-10-13 19:27:52,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2023-10-13 19:28:10,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.905e+02 2.042e+02 2.240e+02 3.212e+02, threshold=4.084e+02, percent-clipped=0.0 2023-10-13 19:28:42,212 INFO [train.py:1031] (3/4) Epoch 24, batch 2500, loss[loss=0.1911, simple_loss=0.282, pruned_loss=0.05009, over 16663.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2799, pruned_loss=0.04796, over 23407573.31 frames. ], batch size: 56, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 19:28:42,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-10-13 19:28:52,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-13 19:28:55,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1477354.6666666667, ans=0.025 2023-10-13 19:28:56,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-13 19:29:02,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=1477354.6666666667, ans=22.5 2023-10-13 19:29:02,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1477354.6666666667, ans=0.0 2023-10-13 19:29:05,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1477401.3333333333, ans=0.125 2023-10-13 19:29:05,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1477401.3333333333, ans=0.0 2023-10-13 19:29:35,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-10-13 19:29:39,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1477494.6666666667, ans=0.125 2023-10-13 19:29:50,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477541.3333333333, ans=0.1 2023-10-13 19:30:00,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1477588.0, ans=0.125 2023-10-13 19:30:08,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.861e+02 1.964e+02 2.131e+02 3.068e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 19:30:09,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-10-13 19:30:20,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-10-13 19:30:28,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1477681.3333333333, ans=0.125 2023-10-13 19:30:30,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1477728.0, ans=0.125 2023-10-13 19:30:42,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1477774.6666666667, ans=0.125 2023-10-13 19:31:30,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477961.3333333333, ans=0.1 2023-10-13 19:31:31,053 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-13 19:31:35,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1477961.3333333333, ans=0.05 2023-10-13 19:31:41,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1478008.0, ans=0.5 2023-10-13 19:31:41,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1478008.0, ans=0.125 2023-10-13 19:31:52,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1478054.6666666667, ans=0.2 2023-10-13 19:32:06,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.824e+02 1.999e+02 2.180e+02 2.959e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-13 19:32:22,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1478148.0, ans=0.125 2023-10-13 19:32:39,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1478194.6666666667, ans=0.0 2023-10-13 19:32:51,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1478241.3333333333, ans=0.0 2023-10-13 19:33:05,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1478334.6666666667, ans=0.125 2023-10-13 19:34:08,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1478474.6666666667, ans=0.125 2023-10-13 19:34:44,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.785e+02 1.940e+02 2.134e+02 2.848e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 19:34:50,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1478568.0, ans=0.0 2023-10-13 19:35:00,093 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:35:36,644 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:36:30,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1478894.6666666667, ans=0.5 2023-10-13 19:36:36,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1478894.6666666667, ans=0.05 2023-10-13 19:36:39,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=12.0 2023-10-13 19:36:44,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1478941.3333333333, ans=0.0 2023-10-13 19:36:50,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-10-13 19:36:52,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1478941.3333333333, ans=0.125 2023-10-13 19:37:10,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1479034.6666666667, ans=0.125 2023-10-13 19:37:12,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.775e+02 1.921e+02 2.192e+02 2.868e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-13 19:37:19,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1479034.6666666667, ans=10.0 2023-10-13 19:37:24,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-13 19:37:39,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1479128.0, ans=0.125 2023-10-13 19:37:42,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-10-13 19:37:54,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1479174.6666666667, ans=0.125 2023-10-13 19:38:01,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1479221.3333333333, ans=0.1 2023-10-13 19:38:38,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1479314.6666666667, ans=0.0 2023-10-13 19:38:54,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1479408.0, ans=0.05 2023-10-13 19:38:56,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1479408.0, ans=0.125 2023-10-13 19:39:06,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1479454.6666666667, ans=0.2 2023-10-13 19:39:16,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1479501.3333333333, ans=0.05 2023-10-13 19:39:19,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.765e+02 1.985e+02 2.204e+02 3.013e+02, threshold=3.970e+02, percent-clipped=0.0 2023-10-13 19:39:51,649 INFO [train.py:1031] (3/4) Epoch 24, batch 3000, loss[loss=0.1858, simple_loss=0.2767, pruned_loss=0.04751, over 16772.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2791, pruned_loss=0.04792, over 25483415.99 frames. ], batch size: 202, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 19:39:56,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1479641.3333333333, ans=0.1 2023-10-13 19:40:43,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1479828.0, ans=0.125 2023-10-13 19:40:55,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1479874.6666666667, ans=0.125 2023-10-13 19:40:55,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1479874.6666666667, ans=0.125 2023-10-13 19:41:00,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-10-13 19:41:07,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1479921.3333333333, ans=0.125 2023-10-13 19:41:11,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1479921.3333333333, ans=0.2 2023-10-13 19:41:13,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1479921.3333333333, ans=0.0 2023-10-13 19:41:19,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.879e+02 2.110e+02 2.472e+02 3.194e+02, threshold=4.220e+02, percent-clipped=0.0 2023-10-13 19:41:22,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1479968.0, ans=0.0 2023-10-13 19:41:45,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1480061.3333333333, ans=0.2 2023-10-13 19:41:59,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1480108.0, ans=0.125 2023-10-13 19:42:32,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480248.0, ans=0.1 2023-10-13 19:42:41,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1480294.6666666667, ans=0.125 2023-10-13 19:42:59,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1480341.3333333333, ans=0.125 2023-10-13 19:43:04,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1480388.0, ans=0.125 2023-10-13 19:43:06,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1480388.0, ans=0.125 2023-10-13 19:43:17,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.837e+02 2.003e+02 2.305e+02 3.554e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-13 19:43:21,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1480434.6666666667, ans=0.0 2023-10-13 19:43:38,091 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-13 19:44:11,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1480668.0, ans=0.125 2023-10-13 19:44:45,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1480761.3333333333, ans=0.2 2023-10-13 19:44:54,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1480808.0, ans=0.125 2023-10-13 19:45:28,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.886e+02 2.019e+02 2.408e+02 3.585e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-13 19:45:35,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1480901.3333333333, ans=0.125 2023-10-13 19:45:57,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1480994.6666666667, ans=0.2 2023-10-13 19:45:57,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1480994.6666666667, ans=0.0 2023-10-13 19:46:04,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1480994.6666666667, ans=0.0 2023-10-13 19:46:16,486 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.82 vs. limit=10.0 2023-10-13 19:46:53,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1481181.3333333333, ans=0.09899494936611666 2023-10-13 19:47:07,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.26 vs. limit=22.5 2023-10-13 19:47:22,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1481321.3333333333, ans=0.2 2023-10-13 19:47:25,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1481321.3333333333, ans=0.2 2023-10-13 19:47:38,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.824e+02 1.979e+02 2.191e+02 3.024e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-13 19:47:46,239 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:47:54,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-13 19:48:06,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1481461.3333333333, ans=0.0 2023-10-13 19:48:18,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-13 19:48:25,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1481554.6666666667, ans=0.0 2023-10-13 19:48:28,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1481554.6666666667, ans=0.1 2023-10-13 19:48:35,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1481554.6666666667, ans=0.1 2023-10-13 19:49:37,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-13 19:49:46,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.802e+02 1.967e+02 2.219e+02 3.023e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 19:49:51,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1481834.6666666667, ans=0.125 2023-10-13 19:49:56,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.48 vs. limit=15.0 2023-10-13 19:50:06,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1481928.0, ans=0.09899494936611666 2023-10-13 19:50:17,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1481928.0, ans=0.125 2023-10-13 19:50:19,469 INFO [train.py:1031] (3/4) Epoch 24, batch 3500, loss[loss=0.1928, simple_loss=0.2911, pruned_loss=0.04729, over 16859.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.279, pruned_loss=0.048, over 27098177.04 frames. ], batch size: 175, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 19:50:50,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1482068.0, ans=10.0 2023-10-13 19:50:50,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1482068.0, ans=0.1 2023-10-13 19:50:50,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.01 vs. limit=15.0 2023-10-13 19:50:54,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1482068.0, ans=0.2 2023-10-13 19:52:00,106 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.908e+02 2.055e+02 2.316e+02 3.688e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-13 19:52:16,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-10-13 19:52:37,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1482394.6666666667, ans=0.125 2023-10-13 19:53:14,402 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.69 vs. limit=10.0 2023-10-13 19:53:16,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1482534.6666666667, ans=0.0 2023-10-13 19:53:28,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-10-13 19:53:53,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1482674.6666666667, ans=0.2 2023-10-13 19:54:12,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.88 vs. limit=15.0 2023-10-13 19:54:15,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.734e+02 1.847e+02 2.041e+02 2.770e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-13 19:54:17,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-10-13 19:54:42,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482861.3333333333, ans=0.1 2023-10-13 19:54:43,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1482861.3333333333, ans=0.125 2023-10-13 19:55:33,975 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-13 19:55:59,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1483094.6666666667, ans=0.2 2023-10-13 19:56:31,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.747e+02 1.864e+02 2.092e+02 2.829e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-13 19:56:36,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1483234.6666666667, ans=0.5 2023-10-13 19:57:08,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1483374.6666666667, ans=10.0 2023-10-13 19:57:08,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1483374.6666666667, ans=0.1 2023-10-13 19:57:11,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1483374.6666666667, ans=0.125 2023-10-13 19:57:21,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1483421.3333333333, ans=0.1 2023-10-13 19:57:30,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1483421.3333333333, ans=0.125 2023-10-13 19:57:44,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.14 vs. limit=22.5 2023-10-13 19:58:12,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1483608.0, ans=0.125 2023-10-13 19:58:16,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1483608.0, ans=0.125 2023-10-13 19:58:27,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1483654.6666666667, ans=0.125 2023-10-13 19:58:37,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-10-13 19:58:38,008 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:58:43,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.728e+02 1.851e+02 2.050e+02 3.109e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-13 19:59:00,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-10-13 19:59:03,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1483794.6666666667, ans=10.0 2023-10-13 19:59:18,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1483841.3333333333, ans=0.0 2023-10-13 19:59:21,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1483841.3333333333, ans=0.0 2023-10-13 19:59:27,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-10-13 19:59:34,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1483888.0, ans=0.125 2023-10-13 20:00:00,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.02 vs. limit=22.5 2023-10-13 20:00:12,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1484028.0, ans=0.0 2023-10-13 20:00:43,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.735e+02 1.880e+02 2.082e+02 3.056e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-13 20:01:14,123 INFO [train.py:1031] (3/4) Epoch 24, batch 4000, loss[loss=0.2037, simple_loss=0.2934, pruned_loss=0.05702, over 16592.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2788, pruned_loss=0.04819, over 28325157.32 frames. ], batch size: 219, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 20:01:20,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1484308.0, ans=0.2 2023-10-13 20:01:25,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1484308.0, ans=0.125 2023-10-13 20:01:29,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.56 vs. limit=15.0 2023-10-13 20:01:30,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1484354.6666666667, ans=0.0 2023-10-13 20:02:13,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1484494.6666666667, ans=0.125 2023-10-13 20:02:19,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-13 20:02:26,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1484541.3333333333, ans=0.0 2023-10-13 20:02:30,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1484541.3333333333, ans=0.2 2023-10-13 20:02:45,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1484588.0, ans=0.125 2023-10-13 20:02:50,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1484634.6666666667, ans=0.125 2023-10-13 20:02:51,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.845e+02 2.070e+02 2.232e+02 2.882e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-13 20:03:00,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1484681.3333333333, ans=0.125 2023-10-13 20:03:19,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1484728.0, ans=0.1 2023-10-13 20:03:36,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1484821.3333333333, ans=0.0 2023-10-13 20:03:38,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1484821.3333333333, ans=0.125 2023-10-13 20:03:54,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1484868.0, ans=0.1 2023-10-13 20:04:22,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-10-13 20:04:33,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1485008.0, ans=0.0 2023-10-13 20:04:54,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1485101.3333333333, ans=0.95 2023-10-13 20:05:00,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.829e+02 1.989e+02 2.181e+02 3.460e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 20:05:39,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1485194.6666666667, ans=0.09899494936611666 2023-10-13 20:05:51,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1485241.3333333333, ans=0.125 2023-10-13 20:05:57,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1485288.0, ans=0.125 2023-10-13 20:06:18,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-13 20:06:18,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1485334.6666666667, ans=6.0 2023-10-13 20:06:18,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1485334.6666666667, ans=15.0 2023-10-13 20:06:29,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1485334.6666666667, ans=0.125 2023-10-13 20:07:08,982 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:07:26,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.799e+02 1.940e+02 2.127e+02 2.910e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-13 20:07:32,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.05 vs. limit=15.0 2023-10-13 20:07:38,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1485614.6666666667, ans=0.0 2023-10-13 20:07:51,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1485661.3333333333, ans=0.125 2023-10-13 20:07:54,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1485708.0, ans=0.125 2023-10-13 20:08:19,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1485801.3333333333, ans=0.125 2023-10-13 20:08:27,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-10-13 20:08:40,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1485848.0, ans=0.04949747468305833 2023-10-13 20:08:44,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1485894.6666666667, ans=0.125 2023-10-13 20:08:47,399 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:09:18,793 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:09:22,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1486034.6666666667, ans=0.2 2023-10-13 20:09:25,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.838e+02 1.984e+02 2.202e+02 2.845e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-13 20:10:19,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486268.0, ans=0.1 2023-10-13 20:10:26,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1486268.0, ans=0.09899494936611666 2023-10-13 20:10:57,484 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.73 vs. limit=15.0 2023-10-13 20:11:00,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1486361.3333333333, ans=0.125 2023-10-13 20:11:02,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1486408.0, ans=0.2 2023-10-13 20:11:26,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1486454.6666666667, ans=0.125 2023-10-13 20:11:28,756 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:11:34,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.797e+02 1.973e+02 2.149e+02 3.326e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 20:11:49,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1486548.0, ans=0.04949747468305833 2023-10-13 20:12:07,152 INFO [train.py:1031] (3/4) Epoch 24, batch 4500, loss[loss=0.2189, simple_loss=0.3, pruned_loss=0.06889, over 16029.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2792, pruned_loss=0.04813, over 29306925.60 frames. ], batch size: 296, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 20:12:18,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1486641.3333333333, ans=0.2 2023-10-13 20:12:25,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1486688.0, ans=0.0 2023-10-13 20:12:35,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1486734.6666666667, ans=0.1 2023-10-13 20:13:11,676 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:13:19,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1486921.3333333333, ans=0.125 2023-10-13 20:13:29,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-10-13 20:13:36,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.748e+02 1.897e+02 2.072e+02 3.009e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-13 20:13:37,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-13 20:13:50,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1487014.6666666667, ans=0.125 2023-10-13 20:14:09,205 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:14:13,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=12.0 2023-10-13 20:14:16,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1487154.6666666667, ans=0.07 2023-10-13 20:14:27,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1487201.3333333333, ans=0.125 2023-10-13 20:14:27,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1487201.3333333333, ans=0.0 2023-10-13 20:14:29,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487201.3333333333, ans=0.1 2023-10-13 20:14:40,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487248.0, ans=0.1 2023-10-13 20:14:42,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1487248.0, ans=0.025 2023-10-13 20:14:49,151 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=22.5 2023-10-13 20:14:51,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1487294.6666666667, ans=0.0 2023-10-13 20:14:51,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1487294.6666666667, ans=0.125 2023-10-13 20:15:05,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1487341.3333333333, ans=0.1 2023-10-13 20:15:11,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1487388.0, ans=0.125 2023-10-13 20:15:22,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=15.0 2023-10-13 20:15:24,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1487434.6666666667, ans=0.125 2023-10-13 20:15:32,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.814e+02 1.983e+02 2.147e+02 2.843e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 20:15:39,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1487481.3333333333, ans=0.0 2023-10-13 20:15:52,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1487528.0, ans=10.0 2023-10-13 20:15:58,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1487528.0, ans=0.2 2023-10-13 20:16:04,428 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.72 vs. limit=22.5 2023-10-13 20:16:42,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1487668.0, ans=0.0 2023-10-13 20:16:54,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1487714.6666666667, ans=0.0 2023-10-13 20:17:06,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1487761.3333333333, ans=0.0 2023-10-13 20:17:10,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1487808.0, ans=0.125 2023-10-13 20:17:11,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1487808.0, ans=0.125 2023-10-13 20:17:13,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-10-13 20:17:15,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1487808.0, ans=0.125 2023-10-13 20:17:39,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.817e+02 1.989e+02 2.174e+02 3.562e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 20:17:57,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1487994.6666666667, ans=0.125 2023-10-13 20:18:15,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-13 20:18:15,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1488041.3333333333, ans=0.125 2023-10-13 20:18:16,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1488041.3333333333, ans=0.125 2023-10-13 20:18:19,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1488088.0, ans=0.2 2023-10-13 20:18:30,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.73 vs. limit=15.0 2023-10-13 20:18:38,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=12.0 2023-10-13 20:18:46,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2023-10-13 20:18:53,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-10-13 20:19:30,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1488321.3333333333, ans=0.2 2023-10-13 20:19:38,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.58 vs. limit=15.0 2023-10-13 20:19:41,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.802e+02 1.954e+02 2.159e+02 2.924e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-13 20:19:52,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1488414.6666666667, ans=0.0 2023-10-13 20:19:54,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-10-13 20:20:03,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488461.3333333333, ans=0.1 2023-10-13 20:20:25,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2023-10-13 20:20:27,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1488554.6666666667, ans=0.0 2023-10-13 20:20:30,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2023-10-13 20:20:36,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1488554.6666666667, ans=0.125 2023-10-13 20:21:11,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1488694.6666666667, ans=0.025 2023-10-13 20:21:17,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1488741.3333333333, ans=0.0 2023-10-13 20:21:21,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1488741.3333333333, ans=0.125 2023-10-13 20:21:32,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488788.0, ans=0.1 2023-10-13 20:21:32,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1488788.0, ans=0.2 2023-10-13 20:21:50,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.829e+02 1.969e+02 2.151e+02 3.037e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-13 20:21:56,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488834.6666666667, ans=0.1 2023-10-13 20:22:14,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1488928.0, ans=0.0 2023-10-13 20:22:18,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-10-13 20:22:22,490 INFO [train.py:1031] (3/4) Epoch 24, batch 5000, loss[loss=0.1833, simple_loss=0.2488, pruned_loss=0.05896, over 12560.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.279, pruned_loss=0.04832, over 30057505.10 frames. ], batch size: 440, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 20:23:20,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1489208.0, ans=0.125 2023-10-13 20:23:53,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.819e+02 1.959e+02 2.203e+02 3.725e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-13 20:23:59,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-10-13 20:24:04,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1489348.0, ans=0.125 2023-10-13 20:24:18,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1489394.6666666667, ans=0.125 2023-10-13 20:24:31,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1489441.3333333333, ans=0.125 2023-10-13 20:24:35,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1489441.3333333333, ans=0.125 2023-10-13 20:24:37,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1489441.3333333333, ans=0.2 2023-10-13 20:25:02,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1489534.6666666667, ans=0.0 2023-10-13 20:25:37,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1489674.6666666667, ans=0.0 2023-10-13 20:25:44,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1489721.3333333333, ans=0.0 2023-10-13 20:25:47,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.06 vs. limit=10.0 2023-10-13 20:25:50,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1489721.3333333333, ans=0.125 2023-10-13 20:25:58,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1489768.0, ans=0.2 2023-10-13 20:25:58,439 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:26:00,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.835e+02 1.945e+02 2.232e+02 2.832e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-13 20:26:06,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1489814.6666666667, ans=0.1 2023-10-13 20:26:08,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1489814.6666666667, ans=0.02 2023-10-13 20:26:14,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-13 20:26:15,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1489814.6666666667, ans=0.125 2023-10-13 20:26:15,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1489814.6666666667, ans=0.125 2023-10-13 20:26:20,277 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.99 vs. limit=22.5 2023-10-13 20:26:22,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1489861.3333333333, ans=0.125 2023-10-13 20:26:36,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1489908.0, ans=0.125 2023-10-13 20:26:56,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.92 vs. limit=10.0 2023-10-13 20:27:03,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1490001.3333333333, ans=0.2 2023-10-13 20:27:14,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490048.0, ans=0.1 2023-10-13 20:27:15,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1490048.0, ans=0.125 2023-10-13 20:27:28,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1490094.6666666667, ans=0.2 2023-10-13 20:27:35,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1490094.6666666667, ans=0.025 2023-10-13 20:28:09,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1490234.6666666667, ans=0.125 2023-10-13 20:28:20,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.900e+02 2.127e+02 2.360e+02 3.191e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-13 20:28:23,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1490281.3333333333, ans=0.0 2023-10-13 20:28:28,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1490281.3333333333, ans=0.2 2023-10-13 20:29:10,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1490421.3333333333, ans=0.125 2023-10-13 20:29:14,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1490421.3333333333, ans=0.125 2023-10-13 20:29:26,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1490468.0, ans=0.0 2023-10-13 20:29:41,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1490514.6666666667, ans=0.0 2023-10-13 20:30:29,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1490701.3333333333, ans=0.09899494936611666 2023-10-13 20:30:35,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1490701.3333333333, ans=0.95 2023-10-13 20:30:35,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.745e+02 1.939e+02 2.233e+02 2.996e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-13 20:30:58,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1490794.6666666667, ans=0.0 2023-10-13 20:31:38,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1490934.6666666667, ans=10.0 2023-10-13 20:31:45,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1490981.3333333333, ans=0.0 2023-10-13 20:32:35,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1491168.0, ans=0.125 2023-10-13 20:32:44,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.763e+02 1.916e+02 2.100e+02 3.166e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-13 20:33:02,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491261.3333333333, ans=0.1 2023-10-13 20:33:05,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1491261.3333333333, ans=0.0 2023-10-13 20:33:08,919 INFO [train.py:1031] (3/4) Epoch 24, batch 5500, loss[loss=0.184, simple_loss=0.28, pruned_loss=0.04401, over 16938.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2787, pruned_loss=0.04809, over 30647709.63 frames. ], batch size: 82, lr: 1.44e-03, grad_scale: 8.0 2023-10-13 20:33:15,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1491308.0, ans=0.0 2023-10-13 20:33:22,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1491354.6666666667, ans=0.125 2023-10-13 20:33:40,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1491401.3333333333, ans=0.2 2023-10-13 20:33:46,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-13 20:33:50,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2023-10-13 20:33:52,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491448.0, ans=0.1 2023-10-13 20:33:54,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1491448.0, ans=6.0 2023-10-13 20:34:01,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1491494.6666666667, ans=0.0 2023-10-13 20:34:11,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1491541.3333333333, ans=0.0 2023-10-13 20:34:14,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1491541.3333333333, ans=0.1 2023-10-13 20:34:24,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1491588.0, ans=0.125 2023-10-13 20:34:33,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491634.6666666667, ans=0.1 2023-10-13 20:34:37,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1491634.6666666667, ans=0.2 2023-10-13 20:34:41,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.786e+02 1.976e+02 2.269e+02 3.044e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 20:34:44,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1491681.3333333333, ans=0.125 2023-10-13 20:34:47,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1491681.3333333333, ans=0.125 2023-10-13 20:34:47,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1491681.3333333333, ans=0.2 2023-10-13 20:35:06,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1491774.6666666667, ans=0.125 2023-10-13 20:35:17,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491821.3333333333, ans=0.1 2023-10-13 20:35:58,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1491961.3333333333, ans=0.0 2023-10-13 20:36:06,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1492008.0, ans=0.125 2023-10-13 20:36:07,641 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-10-13 20:36:18,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1492054.6666666667, ans=0.125 2023-10-13 20:36:25,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-10-13 20:36:35,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.765e+02 1.937e+02 2.133e+02 2.789e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 20:36:39,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1492148.0, ans=0.125 2023-10-13 20:37:00,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1492194.6666666667, ans=0.125 2023-10-13 20:37:05,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1492241.3333333333, ans=0.125 2023-10-13 20:37:20,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1492288.0, ans=0.95 2023-10-13 20:37:21,548 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.58 vs. limit=6.0 2023-10-13 20:37:28,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1492288.0, ans=0.09899494936611666 2023-10-13 20:37:44,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=12.0 2023-10-13 20:38:00,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1492428.0, ans=0.2 2023-10-13 20:38:10,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1492474.6666666667, ans=0.125 2023-10-13 20:38:29,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.51 vs. limit=22.5 2023-10-13 20:38:41,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.815e+02 1.960e+02 2.161e+02 2.790e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-13 20:39:10,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1492708.0, ans=0.125 2023-10-13 20:39:27,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.39 vs. limit=10.0 2023-10-13 20:39:42,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1492801.3333333333, ans=0.0 2023-10-13 20:39:55,480 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:39:58,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.70 vs. limit=10.0 2023-10-13 20:40:31,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1492988.0, ans=0.5 2023-10-13 20:40:39,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1493034.6666666667, ans=0.125 2023-10-13 20:40:39,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1493034.6666666667, ans=0.1 2023-10-13 20:40:50,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-10-13 20:40:51,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.843e+02 1.993e+02 2.201e+02 2.970e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 20:41:03,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1493128.0, ans=0.0 2023-10-13 20:41:18,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1493174.6666666667, ans=0.07 2023-10-13 20:41:40,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1493221.3333333333, ans=0.125 2023-10-13 20:41:46,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2023-10-13 20:41:48,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1493221.3333333333, ans=0.125 2023-10-13 20:41:52,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.60 vs. limit=15.0 2023-10-13 20:42:00,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.24 vs. limit=10.0 2023-10-13 20:42:16,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1493314.6666666667, ans=0.1 2023-10-13 20:42:19,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1493314.6666666667, ans=0.0 2023-10-13 20:42:23,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.81 vs. limit=15.0 2023-10-13 20:42:24,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1493361.3333333333, ans=0.0 2023-10-13 20:42:27,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1493361.3333333333, ans=0.1 2023-10-13 20:42:34,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1493361.3333333333, ans=0.07 2023-10-13 20:42:38,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1493408.0, ans=0.04949747468305833 2023-10-13 20:42:41,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1493408.0, ans=0.2 2023-10-13 20:43:12,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1493501.3333333333, ans=0.1 2023-10-13 20:43:15,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.859e+02 2.041e+02 2.194e+02 3.007e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-13 20:43:31,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1493594.6666666667, ans=0.0 2023-10-13 20:43:42,328 INFO [train.py:1031] (3/4) Epoch 24, batch 6000, loss[loss=0.1815, simple_loss=0.2688, pruned_loss=0.04703, over 16924.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2793, pruned_loss=0.04837, over 31130496.59 frames. ], batch size: 82, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 20:43:43,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-10-13 20:43:47,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1493641.3333333333, ans=0.0 2023-10-13 20:43:59,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1493688.0, ans=0.0 2023-10-13 20:44:00,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1493688.0, ans=0.125 2023-10-13 20:44:15,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1493734.6666666667, ans=0.125 2023-10-13 20:44:27,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1493781.3333333333, ans=0.2 2023-10-13 20:44:54,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1493874.6666666667, ans=0.07 2023-10-13 20:45:12,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1493968.0, ans=0.0 2023-10-13 20:45:17,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1493968.0, ans=0.125 2023-10-13 20:45:17,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.32 vs. limit=22.5 2023-10-13 20:45:23,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.806e+02 2.001e+02 2.189e+02 3.130e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-13 20:45:32,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1494014.6666666667, ans=0.0 2023-10-13 20:45:32,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-13 20:45:59,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.93 vs. limit=10.0 2023-10-13 20:45:59,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1494108.0, ans=0.5 2023-10-13 20:46:04,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1494154.6666666667, ans=0.125 2023-10-13 20:46:11,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1494154.6666666667, ans=0.0 2023-10-13 20:46:14,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1494201.3333333333, ans=0.0 2023-10-13 20:46:25,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1494248.0, ans=0.1 2023-10-13 20:46:35,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=22.5 2023-10-13 20:46:40,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1494294.6666666667, ans=0.2 2023-10-13 20:46:43,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1494294.6666666667, ans=0.0 2023-10-13 20:46:49,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-13 20:46:56,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1494341.3333333333, ans=0.0 2023-10-13 20:47:05,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1494388.0, ans=0.95 2023-10-13 20:47:28,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1494434.6666666667, ans=0.125 2023-10-13 20:47:29,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.863e+02 2.002e+02 2.323e+02 3.238e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 20:47:38,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1494481.3333333333, ans=0.0 2023-10-13 20:47:56,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1494574.6666666667, ans=0.5 2023-10-13 20:48:00,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=22.5 2023-10-13 20:48:11,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.34 vs. limit=10.0 2023-10-13 20:48:47,351 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.64 vs. limit=15.0 2023-10-13 20:48:51,855 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:49:20,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1494854.6666666667, ans=0.125 2023-10-13 20:49:26,595 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:49:31,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-10-13 20:49:39,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.830e+02 2.035e+02 2.330e+02 3.349e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-13 20:49:58,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1494994.6666666667, ans=0.0 2023-10-13 20:50:09,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1495041.3333333333, ans=0.125 2023-10-13 20:50:14,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1495041.3333333333, ans=0.2 2023-10-13 20:50:18,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1495088.0, ans=0.2 2023-10-13 20:50:29,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1495134.6666666667, ans=0.2 2023-10-13 20:50:57,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1495228.0, ans=0.1 2023-10-13 20:51:00,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1495228.0, ans=0.125 2023-10-13 20:51:12,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1495274.6666666667, ans=0.0 2023-10-13 20:51:33,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.39 vs. limit=22.5 2023-10-13 20:51:45,633 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:51:47,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1495368.0, ans=0.0 2023-10-13 20:51:52,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.810e+02 2.047e+02 2.268e+02 5.508e+02, threshold=4.094e+02, percent-clipped=1.0 2023-10-13 20:52:19,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1495508.0, ans=0.125 2023-10-13 20:52:26,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1495554.6666666667, ans=15.0 2023-10-13 20:52:31,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-10-13 20:52:37,419 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-10-13 20:52:37,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-10-13 20:52:38,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1495601.3333333333, ans=0.125 2023-10-13 20:52:38,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1495601.3333333333, ans=0.2 2023-10-13 20:53:06,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1495694.6666666667, ans=0.125 2023-10-13 20:53:16,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1495741.3333333333, ans=0.125 2023-10-13 20:53:21,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1495741.3333333333, ans=0.1 2023-10-13 20:53:47,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-13 20:53:50,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1495834.6666666667, ans=0.125 2023-10-13 20:53:55,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.804e+02 1.971e+02 2.270e+02 3.189e+02, threshold=3.942e+02, percent-clipped=0.0 2023-10-13 20:54:19,811 INFO [train.py:1031] (3/4) Epoch 24, batch 6500, loss[loss=0.1903, simple_loss=0.2828, pruned_loss=0.04887, over 16867.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2795, pruned_loss=0.04825, over 31494903.59 frames. ], batch size: 87, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 20:54:23,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1495974.6666666667, ans=0.125 2023-10-13 20:55:04,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1496114.6666666667, ans=0.0 2023-10-13 20:55:12,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496114.6666666667, ans=0.1 2023-10-13 20:55:13,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-13 20:55:31,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-10-13 20:55:34,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496208.0, ans=0.1 2023-10-13 20:56:01,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1496301.3333333333, ans=0.0 2023-10-13 20:56:07,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2023-10-13 20:56:13,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.824e+02 2.010e+02 2.224e+02 3.840e+02, threshold=4.020e+02, percent-clipped=0.0 2023-10-13 20:56:18,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496348.0, ans=0.1 2023-10-13 20:56:21,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1496348.0, ans=0.0 2023-10-13 20:56:41,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=22.5 2023-10-13 20:57:04,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1496534.6666666667, ans=0.125 2023-10-13 20:57:12,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1496581.3333333333, ans=0.125 2023-10-13 20:57:13,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1496581.3333333333, ans=0.5 2023-10-13 20:57:25,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1496628.0, ans=0.2 2023-10-13 20:57:28,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1496628.0, ans=0.125 2023-10-13 20:57:44,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1496674.6666666667, ans=0.125 2023-10-13 20:57:45,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1496674.6666666667, ans=0.125 2023-10-13 20:57:46,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1496674.6666666667, ans=0.125 2023-10-13 20:57:50,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1496674.6666666667, ans=0.2 2023-10-13 20:58:05,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1496768.0, ans=0.0 2023-10-13 20:58:16,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.827e+02 1.985e+02 2.156e+02 2.860e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-13 20:58:53,474 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-10-13 20:59:53,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1497188.0, ans=0.125 2023-10-13 20:59:54,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-13 21:00:10,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1497234.6666666667, ans=0.0 2023-10-13 21:00:19,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.772e+02 1.893e+02 2.084e+02 2.902e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-13 21:00:21,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497281.3333333333, ans=0.1 2023-10-13 21:00:42,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1497328.0, ans=0.0 2023-10-13 21:01:02,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1497374.6666666667, ans=0.0 2023-10-13 21:01:02,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1497374.6666666667, ans=0.2 2023-10-13 21:01:22,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1497468.0, ans=0.125 2023-10-13 21:01:29,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.03 vs. limit=15.0 2023-10-13 21:01:49,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1497561.3333333333, ans=0.125 2023-10-13 21:02:09,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.66 vs. limit=15.0 2023-10-13 21:02:14,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1497608.0, ans=0.0 2023-10-13 21:02:23,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1497654.6666666667, ans=0.125 2023-10-13 21:02:49,022 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.686e+02 1.833e+02 2.073e+02 2.986e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-13 21:03:10,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1497841.3333333333, ans=0.0 2023-10-13 21:03:41,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1497934.6666666667, ans=0.05 2023-10-13 21:03:46,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-10-13 21:04:00,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1497981.3333333333, ans=0.2 2023-10-13 21:04:03,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1497981.3333333333, ans=0.0 2023-10-13 21:04:18,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1498074.6666666667, ans=0.125 2023-10-13 21:04:44,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1498168.0, ans=0.1 2023-10-13 21:04:52,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1498168.0, ans=0.0 2023-10-13 21:04:53,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1498168.0, ans=0.0 2023-10-13 21:05:00,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.851e+02 2.104e+02 2.321e+02 3.870e+02, threshold=4.208e+02, percent-clipped=1.0 2023-10-13 21:05:22,817 INFO [train.py:1031] (3/4) Epoch 24, batch 7000, loss[loss=0.2099, simple_loss=0.2949, pruned_loss=0.0625, over 16607.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2802, pruned_loss=0.04834, over 31780436.20 frames. ], batch size: 61, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:06:18,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1498448.0, ans=0.0 2023-10-13 21:06:21,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1498448.0, ans=0.125 2023-10-13 21:06:53,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1498588.0, ans=0.125 2023-10-13 21:06:53,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1498588.0, ans=0.125 2023-10-13 21:07:14,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.780e+02 1.929e+02 2.087e+02 3.734e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-13 21:07:25,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1498728.0, ans=0.0 2023-10-13 21:07:28,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-10-13 21:07:35,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1498774.6666666667, ans=0.1 2023-10-13 21:07:39,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1498774.6666666667, ans=0.0 2023-10-13 21:07:44,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1498821.3333333333, ans=0.125 2023-10-13 21:07:50,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1498821.3333333333, ans=0.0 2023-10-13 21:07:56,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.31 vs. limit=10.0 2023-10-13 21:07:56,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1498868.0, ans=0.2 2023-10-13 21:08:01,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1498868.0, ans=0.125 2023-10-13 21:08:35,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1499008.0, ans=0.125 2023-10-13 21:08:46,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1499054.6666666667, ans=0.2 2023-10-13 21:08:50,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1499054.6666666667, ans=0.125 2023-10-13 21:09:01,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1499101.3333333333, ans=0.1 2023-10-13 21:09:11,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1499101.3333333333, ans=0.125 2023-10-13 21:09:14,655 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=15.0 2023-10-13 21:09:14,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.799e+02 1.931e+02 2.121e+02 2.811e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-13 21:09:16,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499148.0, ans=0.1 2023-10-13 21:09:32,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1499194.6666666667, ans=0.1 2023-10-13 21:10:26,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1499334.6666666667, ans=0.1 2023-10-13 21:10:47,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1499428.0, ans=0.0 2023-10-13 21:11:12,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1499474.6666666667, ans=0.1 2023-10-13 21:11:14,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1499474.6666666667, ans=0.0 2023-10-13 21:11:14,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1499474.6666666667, ans=0.0 2023-10-13 21:11:17,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1499521.3333333333, ans=0.1 2023-10-13 21:11:48,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.761e+02 1.974e+02 2.191e+02 4.595e+02, threshold=3.948e+02, percent-clipped=1.0 2023-10-13 21:11:52,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1499614.6666666667, ans=0.1 2023-10-13 21:12:04,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1499661.3333333333, ans=0.125 2023-10-13 21:12:29,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1499754.6666666667, ans=0.125 2023-10-13 21:12:50,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1499801.3333333333, ans=0.0 2023-10-13 21:13:12,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.16 vs. limit=15.0 2023-10-13 21:13:29,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1499941.3333333333, ans=0.125 2023-10-13 21:13:59,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1500034.6666666667, ans=0.0 2023-10-13 21:14:02,946 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:14:06,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.748e+02 1.903e+02 2.094e+02 3.174e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-13 21:14:15,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1500128.0, ans=0.1 2023-10-13 21:14:31,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1500174.6666666667, ans=0.125 2023-10-13 21:14:37,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1500174.6666666667, ans=0.0 2023-10-13 21:14:50,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1500221.3333333333, ans=0.2 2023-10-13 21:14:54,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1500268.0, ans=0.125 2023-10-13 21:15:09,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1500314.6666666667, ans=0.05 2023-10-13 21:15:40,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1500454.6666666667, ans=0.125 2023-10-13 21:15:43,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-13 21:15:46,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1500454.6666666667, ans=0.125 2023-10-13 21:15:51,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1500501.3333333333, ans=0.125 2023-10-13 21:15:51,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1500501.3333333333, ans=0.125 2023-10-13 21:15:53,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1500501.3333333333, ans=0.0 2023-10-13 21:16:03,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500501.3333333333, ans=0.1 2023-10-13 21:16:08,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.833e+02 2.032e+02 2.250e+02 3.239e+02, threshold=4.063e+02, percent-clipped=0.0 2023-10-13 21:16:21,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1500594.6666666667, ans=0.0 2023-10-13 21:16:31,166 INFO [train.py:1031] (3/4) Epoch 24, batch 7500, loss[loss=0.1816, simple_loss=0.2774, pruned_loss=0.04291, over 16950.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2801, pruned_loss=0.04838, over 32012602.81 frames. ], batch size: 93, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:16:47,758 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.76 vs. limit=15.0 2023-10-13 21:17:10,731 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-10-13 21:17:14,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1500781.3333333333, ans=0.05 2023-10-13 21:17:35,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1500828.0, ans=0.125 2023-10-13 21:17:47,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1500874.6666666667, ans=0.125 2023-10-13 21:18:03,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1500921.3333333333, ans=0.0 2023-10-13 21:18:08,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 21:18:21,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1501014.6666666667, ans=0.0 2023-10-13 21:18:25,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.856e+02 2.084e+02 2.355e+02 3.282e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-13 21:18:25,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1501014.6666666667, ans=0.1 2023-10-13 21:18:25,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1501014.6666666667, ans=0.125 2023-10-13 21:19:02,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1501154.6666666667, ans=0.0 2023-10-13 21:19:16,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1501201.3333333333, ans=0.0 2023-10-13 21:19:26,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1501201.3333333333, ans=0.1 2023-10-13 21:20:19,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501388.0, ans=0.1 2023-10-13 21:20:30,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1501434.6666666667, ans=0.2 2023-10-13 21:20:48,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.872e+02 2.068e+02 2.239e+02 3.340e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-13 21:21:00,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1501528.0, ans=0.0 2023-10-13 21:21:06,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1501528.0, ans=0.125 2023-10-13 21:21:09,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1501528.0, ans=0.125 2023-10-13 21:21:37,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1501668.0, ans=0.125 2023-10-13 21:21:38,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1501668.0, ans=0.0 2023-10-13 21:21:40,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1501668.0, ans=0.1 2023-10-13 21:21:55,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1501714.6666666667, ans=0.0 2023-10-13 21:21:57,671 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-10-13 21:22:35,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1501808.0, ans=0.2 2023-10-13 21:23:15,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1501948.0, ans=0.125 2023-10-13 21:23:20,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.771e+02 1.944e+02 2.197e+02 2.926e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-13 21:23:40,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0 2023-10-13 21:23:54,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1502041.3333333333, ans=0.0 2023-10-13 21:24:20,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1502134.6666666667, ans=0.125 2023-10-13 21:24:39,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502181.3333333333, ans=0.1 2023-10-13 21:24:46,597 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:25:13,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1502274.6666666667, ans=0.07 2023-10-13 21:25:28,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1502321.3333333333, ans=0.2 2023-10-13 21:25:33,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.62 vs. limit=12.0 2023-10-13 21:25:37,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1502368.0, ans=0.125 2023-10-13 21:25:37,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1502368.0, ans=0.0 2023-10-13 21:25:51,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1502414.6666666667, ans=0.125 2023-10-13 21:25:56,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1502414.6666666667, ans=0.2 2023-10-13 21:25:56,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.819e+02 1.925e+02 2.103e+02 2.878e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-13 21:26:04,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1502414.6666666667, ans=0.0 2023-10-13 21:26:26,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1502508.0, ans=0.125 2023-10-13 21:26:29,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1502508.0, ans=0.0 2023-10-13 21:26:46,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1502554.6666666667, ans=0.0 2023-10-13 21:27:00,154 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:27:05,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1502648.0, ans=0.1 2023-10-13 21:27:15,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-10-13 21:27:17,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1502694.6666666667, ans=0.125 2023-10-13 21:27:27,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1502741.3333333333, ans=0.125 2023-10-13 21:27:37,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.25 vs. limit=15.0 2023-10-13 21:27:38,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1502788.0, ans=0.125 2023-10-13 21:27:51,519 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.59 vs. limit=6.0 2023-10-13 21:28:09,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1502881.3333333333, ans=0.2 2023-10-13 21:28:09,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-10-13 21:28:10,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.707e+02 1.814e+02 1.947e+02 2.778e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-13 21:28:19,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1502928.0, ans=0.0 2023-10-13 21:28:23,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1502928.0, ans=0.125 2023-10-13 21:28:24,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1502928.0, ans=0.125 2023-10-13 21:28:29,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-10-13 21:28:31,882 INFO [train.py:1031] (3/4) Epoch 24, batch 8000, loss[loss=0.1657, simple_loss=0.2609, pruned_loss=0.03522, over 16357.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2796, pruned_loss=0.04776, over 32209833.90 frames. ], batch size: 50, lr: 1.43e-03, grad_scale: 32.0 2023-10-13 21:28:40,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-13 21:28:50,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1503021.3333333333, ans=0.05 2023-10-13 21:28:51,402 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:29:16,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1503114.6666666667, ans=0.125 2023-10-13 21:29:48,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1503254.6666666667, ans=0.5 2023-10-13 21:29:52,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1503254.6666666667, ans=0.04949747468305833 2023-10-13 21:29:55,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1503254.6666666667, ans=0.125 2023-10-13 21:30:00,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503301.3333333333, ans=0.1 2023-10-13 21:30:00,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1503301.3333333333, ans=0.1 2023-10-13 21:30:02,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1503301.3333333333, ans=0.07 2023-10-13 21:30:13,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.866e+02 2.082e+02 2.483e+02 3.426e+02, threshold=4.164e+02, percent-clipped=0.0 2023-10-13 21:30:32,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1503441.3333333333, ans=0.0 2023-10-13 21:30:36,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1503441.3333333333, ans=0.2 2023-10-13 21:30:40,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-10-13 21:30:47,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1503488.0, ans=0.125 2023-10-13 21:31:06,215 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-10-13 21:31:12,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1503581.3333333333, ans=0.0 2023-10-13 21:31:21,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1503628.0, ans=0.125 2023-10-13 21:31:24,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1503628.0, ans=0.125 2023-10-13 21:31:37,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1503674.6666666667, ans=0.2 2023-10-13 21:32:19,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1503768.0, ans=0.125 2023-10-13 21:32:32,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.67 vs. limit=15.0 2023-10-13 21:32:34,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.760e+02 1.949e+02 2.138e+02 2.963e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-13 21:32:55,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-13 21:32:57,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1503861.3333333333, ans=0.2 2023-10-13 21:33:03,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1503908.0, ans=0.5 2023-10-13 21:33:31,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1504001.3333333333, ans=0.125 2023-10-13 21:33:39,974 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.88 vs. limit=10.0 2023-10-13 21:33:57,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1504094.6666666667, ans=0.2 2023-10-13 21:33:57,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-10-13 21:34:04,497 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=22.5 2023-10-13 21:34:20,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1504188.0, ans=0.125 2023-10-13 21:34:47,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.801e+02 1.915e+02 2.100e+02 3.063e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-13 21:35:15,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1504374.6666666667, ans=0.0 2023-10-13 21:35:49,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1504468.0, ans=0.0 2023-10-13 21:35:56,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.10 vs. limit=6.0 2023-10-13 21:35:59,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1504514.6666666667, ans=0.125 2023-10-13 21:36:14,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1504561.3333333333, ans=0.125 2023-10-13 21:36:17,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1504561.3333333333, ans=0.125 2023-10-13 21:36:31,781 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-13 21:36:47,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-10-13 21:36:59,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.90 vs. limit=15.0 2023-10-13 21:37:01,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1504748.0, ans=0.125 2023-10-13 21:37:02,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=15.0 2023-10-13 21:37:05,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.764e+02 1.883e+02 2.123e+02 2.958e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-13 21:37:39,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1504888.0, ans=0.0 2023-10-13 21:37:50,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-10-13 21:37:58,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1504934.6666666667, ans=0.0 2023-10-13 21:38:02,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1504934.6666666667, ans=0.2 2023-10-13 21:38:12,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-10-13 21:38:13,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1504981.3333333333, ans=0.09899494936611666 2023-10-13 21:38:24,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1504981.3333333333, ans=0.125 2023-10-13 21:38:33,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1505028.0, ans=0.125 2023-10-13 21:38:36,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1505028.0, ans=0.125 2023-10-13 21:39:12,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1505168.0, ans=0.2 2023-10-13 21:39:25,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-13 21:39:27,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.843e+02 1.978e+02 2.134e+02 3.003e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 21:39:49,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.79 vs. limit=15.0 2023-10-13 21:39:50,163 INFO [train.py:1031] (3/4) Epoch 24, batch 8500, loss[loss=0.1922, simple_loss=0.2929, pruned_loss=0.04573, over 16908.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2799, pruned_loss=0.0477, over 32348038.26 frames. ], batch size: 87, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:40:09,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.02 vs. limit=15.0 2023-10-13 21:40:35,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.45 vs. limit=15.0 2023-10-13 21:40:48,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1505494.6666666667, ans=0.2 2023-10-13 21:41:00,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1505541.3333333333, ans=0.125 2023-10-13 21:41:10,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1505588.0, ans=0.0 2023-10-13 21:41:25,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1505634.6666666667, ans=0.125 2023-10-13 21:41:38,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.976e+02 2.106e+02 2.450e+02 3.139e+02, threshold=4.212e+02, percent-clipped=0.0 2023-10-13 21:41:47,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=1505728.0, ans=22.5 2023-10-13 21:41:47,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1505728.0, ans=0.2 2023-10-13 21:42:00,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1505774.6666666667, ans=0.125 2023-10-13 21:42:02,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1505774.6666666667, ans=0.05 2023-10-13 21:42:05,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1505774.6666666667, ans=0.125 2023-10-13 21:42:24,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1505821.3333333333, ans=0.125 2023-10-13 21:42:31,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.53 vs. limit=10.0 2023-10-13 21:43:03,233 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=28.88 vs. limit=22.5 2023-10-13 21:43:14,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1506008.0, ans=0.0 2023-10-13 21:43:28,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1506054.6666666667, ans=0.0 2023-10-13 21:43:40,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1506054.6666666667, ans=0.0 2023-10-13 21:43:43,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1506101.3333333333, ans=0.0 2023-10-13 21:43:54,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1506101.3333333333, ans=0.0 2023-10-13 21:44:05,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1506148.0, ans=0.125 2023-10-13 21:44:07,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.735e+02 1.925e+02 2.341e+02 2.945e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-13 21:44:21,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1506194.6666666667, ans=0.0 2023-10-13 21:44:29,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-13 21:44:30,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1506241.3333333333, ans=0.125 2023-10-13 21:44:44,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1506288.0, ans=0.0 2023-10-13 21:44:51,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1506334.6666666667, ans=0.0 2023-10-13 21:44:57,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1506334.6666666667, ans=0.0 2023-10-13 21:45:10,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1506381.3333333333, ans=0.0 2023-10-13 21:45:53,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1506521.3333333333, ans=0.0 2023-10-13 21:46:12,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1506568.0, ans=0.125 2023-10-13 21:46:24,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.755e+02 1.951e+02 2.208e+02 2.996e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 21:46:30,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1506661.3333333333, ans=0.2 2023-10-13 21:46:36,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1506661.3333333333, ans=0.0 2023-10-13 21:46:48,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1506708.0, ans=0.125 2023-10-13 21:47:02,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1506754.6666666667, ans=0.0 2023-10-13 21:47:14,244 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:47:20,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1506848.0, ans=0.125 2023-10-13 21:47:32,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1506894.6666666667, ans=0.0 2023-10-13 21:47:38,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1506894.6666666667, ans=0.0 2023-10-13 21:47:50,063 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.69 vs. limit=10.0 2023-10-13 21:47:58,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1506988.0, ans=0.0 2023-10-13 21:47:58,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1506988.0, ans=0.0 2023-10-13 21:47:59,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1506988.0, ans=0.2 2023-10-13 21:48:29,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.777e+02 1.964e+02 2.237e+02 3.329e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 21:48:41,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1507128.0, ans=0.2 2023-10-13 21:49:01,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1507174.6666666667, ans=0.1 2023-10-13 21:49:05,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507221.3333333333, ans=0.1 2023-10-13 21:49:12,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-10-13 21:49:22,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2023-10-13 21:49:26,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1507268.0, ans=0.015 2023-10-13 21:49:41,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.29 vs. limit=10.0 2023-10-13 21:49:57,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1507408.0, ans=0.0 2023-10-13 21:50:25,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1507501.3333333333, ans=10.0 2023-10-13 21:50:30,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1507548.0, ans=0.125 2023-10-13 21:50:33,450 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:50:37,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.798e+02 1.992e+02 2.187e+02 2.759e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-13 21:50:43,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1507594.6666666667, ans=0.125 2023-10-13 21:50:43,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2023-10-13 21:50:50,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507594.6666666667, ans=0.1 2023-10-13 21:50:54,721 INFO [train.py:1031] (3/4) Epoch 24, batch 9000, loss[loss=0.1891, simple_loss=0.2856, pruned_loss=0.04632, over 16584.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2791, pruned_loss=0.04739, over 32455664.29 frames. ], batch size: 219, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:51:10,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1507688.0, ans=0.0 2023-10-13 21:51:33,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-13 21:51:39,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507781.3333333333, ans=0.1 2023-10-13 21:51:44,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1507828.0, ans=0.2 2023-10-13 21:51:50,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1507828.0, ans=0.125 2023-10-13 21:51:51,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=12.0 2023-10-13 21:51:54,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.08 vs. limit=15.0 2023-10-13 21:52:03,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1507874.6666666667, ans=0.0 2023-10-13 21:52:12,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1507921.3333333333, ans=0.125 2023-10-13 21:52:16,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1507968.0, ans=0.0 2023-10-13 21:52:34,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.777e+02 1.911e+02 2.138e+02 4.741e+02, threshold=3.821e+02, percent-clipped=1.0 2023-10-13 21:52:36,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1508014.6666666667, ans=0.015 2023-10-13 21:52:37,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1508014.6666666667, ans=0.125 2023-10-13 21:52:46,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.44 vs. limit=10.0 2023-10-13 21:53:21,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-13 21:53:27,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.04 vs. limit=12.0 2023-10-13 21:53:29,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-13 21:53:31,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1508248.0, ans=0.125 2023-10-13 21:53:40,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1508248.0, ans=0.125 2023-10-13 21:53:40,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.50 vs. limit=15.0 2023-10-13 21:53:47,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1508294.6666666667, ans=0.125 2023-10-13 21:54:23,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1508434.6666666667, ans=0.0 2023-10-13 21:54:40,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.887e+02 2.094e+02 2.329e+02 3.341e+02, threshold=4.189e+02, percent-clipped=0.0 2023-10-13 21:54:43,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1508528.0, ans=0.125 2023-10-13 21:54:58,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1508574.6666666667, ans=0.0 2023-10-13 21:55:08,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1508621.3333333333, ans=0.1 2023-10-13 21:55:11,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1508621.3333333333, ans=0.125 2023-10-13 21:55:21,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1508668.0, ans=0.1 2023-10-13 21:55:44,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1508761.3333333333, ans=0.125 2023-10-13 21:55:56,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1508808.0, ans=0.0 2023-10-13 21:56:14,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1508854.6666666667, ans=0.125 2023-10-13 21:56:15,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1508901.3333333333, ans=0.125 2023-10-13 21:56:26,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1508948.0, ans=0.0 2023-10-13 21:56:35,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.847e+02 2.002e+02 2.224e+02 2.855e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 21:56:59,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1509041.3333333333, ans=0.125 2023-10-13 21:57:03,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1509088.0, ans=0.125 2023-10-13 21:57:04,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1509088.0, ans=0.125 2023-10-13 21:57:07,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1509088.0, ans=0.125 2023-10-13 21:57:23,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1509134.6666666667, ans=0.1 2023-10-13 21:57:26,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.32 vs. limit=15.0 2023-10-13 21:57:31,767 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:57:41,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1509228.0, ans=0.025 2023-10-13 21:57:41,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1509228.0, ans=0.125 2023-10-13 21:57:53,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1509274.6666666667, ans=0.125 2023-10-13 21:57:54,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509274.6666666667, ans=0.1 2023-10-13 21:58:25,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509368.0, ans=0.1 2023-10-13 21:58:29,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1509368.0, ans=0.04949747468305833 2023-10-13 21:58:43,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.798e+02 1.973e+02 2.115e+02 3.364e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 21:58:54,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1509461.3333333333, ans=0.125 2023-10-13 21:58:58,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1509461.3333333333, ans=0.0 2023-10-13 21:59:17,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1509554.6666666667, ans=0.5 2023-10-13 21:59:17,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1509554.6666666667, ans=0.125 2023-10-13 21:59:22,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.05 vs. limit=5.0 2023-10-13 21:59:43,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1509648.0, ans=0.125 2023-10-13 21:59:51,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1509694.6666666667, ans=0.04949747468305833 2023-10-13 22:00:08,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1509741.3333333333, ans=0.2 2023-10-13 22:00:25,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1509834.6666666667, ans=0.125 2023-10-13 22:00:30,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1509834.6666666667, ans=0.05 2023-10-13 22:00:35,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1509881.3333333333, ans=0.1 2023-10-13 22:00:40,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1509881.3333333333, ans=0.125 2023-10-13 22:00:46,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.835e+02 2.050e+02 2.297e+02 3.409e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-13 22:00:51,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1509928.0, ans=0.0 2023-10-13 22:00:59,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1509928.0, ans=0.125 2023-10-13 22:01:01,169 INFO [train.py:1031] (3/4) Epoch 24, batch 9500, loss[loss=0.2012, simple_loss=0.2952, pruned_loss=0.05358, over 16480.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2797, pruned_loss=0.04772, over 32471062.00 frames. ], batch size: 266, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 22:01:23,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2023-10-13 22:01:54,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2023-10-13 22:02:41,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1510301.3333333333, ans=0.1 2023-10-13 22:02:51,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.807e+02 2.011e+02 2.342e+02 3.306e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-13 22:02:55,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1510394.6666666667, ans=0.125 2023-10-13 22:03:04,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1510394.6666666667, ans=0.125 2023-10-13 22:03:37,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1510534.6666666667, ans=0.0 2023-10-13 22:03:45,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1510534.6666666667, ans=0.025 2023-10-13 22:03:56,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1510581.3333333333, ans=0.07 2023-10-13 22:04:04,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.51 vs. limit=10.0 2023-10-13 22:04:09,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1510628.0, ans=0.1 2023-10-13 22:04:09,861 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.84 vs. limit=15.0 2023-10-13 22:04:14,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-13 22:04:25,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1510674.6666666667, ans=0.0 2023-10-13 22:04:30,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1510674.6666666667, ans=0.2 2023-10-13 22:04:44,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1510721.3333333333, ans=0.125 2023-10-13 22:04:56,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1510768.0, ans=0.1 2023-10-13 22:04:57,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1510768.0, ans=0.04949747468305833 2023-10-13 22:05:11,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.735e+02 1.864e+02 2.171e+02 3.387e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-13 22:05:29,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1510861.3333333333, ans=0.2 2023-10-13 22:05:33,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1510908.0, ans=0.1 2023-10-13 22:07:00,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1511234.6666666667, ans=0.125 2023-10-13 22:07:10,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1511281.3333333333, ans=0.1 2023-10-13 22:07:13,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1511281.3333333333, ans=0.1 2023-10-13 22:07:17,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.834e+02 2.020e+02 2.337e+02 3.582e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-13 22:07:21,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1511328.0, ans=0.2 2023-10-13 22:07:22,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1511328.0, ans=0.0 2023-10-13 22:08:19,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1511561.3333333333, ans=0.2 2023-10-13 22:08:37,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1511608.0, ans=0.05 2023-10-13 22:08:50,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1511654.6666666667, ans=0.0 2023-10-13 22:09:06,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1511701.3333333333, ans=0.125 2023-10-13 22:09:17,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.720e+02 1.869e+02 2.083e+02 3.448e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-13 22:09:25,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1511794.6666666667, ans=0.1 2023-10-13 22:09:41,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1511841.3333333333, ans=0.125 2023-10-13 22:09:59,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1511934.6666666667, ans=0.125 2023-10-13 22:10:04,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.87 vs. limit=22.5 2023-10-13 22:10:08,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1511981.3333333333, ans=0.0 2023-10-13 22:10:10,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-10-13 22:10:12,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1511981.3333333333, ans=0.125 2023-10-13 22:10:16,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1511981.3333333333, ans=0.2 2023-10-13 22:10:21,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1512028.0, ans=0.125 2023-10-13 22:10:31,622 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.053e-03 2023-10-13 22:10:31,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1512074.6666666667, ans=0.125 2023-10-13 22:11:13,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.777e+02 1.989e+02 2.271e+02 3.617e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-13 22:11:27,589 INFO [train.py:1031] (3/4) Epoch 24, batch 10000, loss[loss=0.1913, simple_loss=0.2735, pruned_loss=0.05456, over 16684.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2789, pruned_loss=0.04747, over 32537086.19 frames. ], batch size: 202, lr: 1.43e-03, grad_scale: 32.0 2023-10-13 22:11:34,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1512308.0, ans=0.5 2023-10-13 22:11:42,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1512354.6666666667, ans=0.1 2023-10-13 22:11:50,663 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.57 vs. limit=22.5 2023-10-13 22:12:01,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1512401.3333333333, ans=0.125 2023-10-13 22:12:04,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1512448.0, ans=0.07 2023-10-13 22:12:11,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1512448.0, ans=0.125 2023-10-13 22:12:39,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1512541.3333333333, ans=0.0 2023-10-13 22:12:41,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1512588.0, ans=0.2 2023-10-13 22:12:52,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1512588.0, ans=0.0 2023-10-13 22:12:56,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1512634.6666666667, ans=0.125 2023-10-13 22:13:06,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1512634.6666666667, ans=0.1 2023-10-13 22:13:18,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.847e+02 1.993e+02 2.292e+02 3.385e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 22:13:28,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1512728.0, ans=0.125 2023-10-13 22:13:33,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1512728.0, ans=0.125 2023-10-13 22:13:55,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1512821.3333333333, ans=10.0 2023-10-13 22:14:02,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-10-13 22:14:06,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-10-13 22:14:10,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=12.0 2023-10-13 22:14:19,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-10-13 22:14:26,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1512914.6666666667, ans=0.125 2023-10-13 22:14:32,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1512961.3333333333, ans=0.125 2023-10-13 22:14:35,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1512961.3333333333, ans=0.125 2023-10-13 22:14:40,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.23 vs. limit=12.0 2023-10-13 22:14:44,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513008.0, ans=0.1 2023-10-13 22:14:59,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1513054.6666666667, ans=0.025 2023-10-13 22:15:09,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1513101.3333333333, ans=0.125 2023-10-13 22:15:26,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.844e+02 2.032e+02 2.209e+02 2.810e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-13 22:15:48,611 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:15:53,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1513241.3333333333, ans=0.0 2023-10-13 22:16:25,570 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:16:56,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1513521.3333333333, ans=0.09899494936611666 2023-10-13 22:16:58,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1513521.3333333333, ans=0.07 2023-10-13 22:17:16,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1513568.0, ans=0.0 2023-10-13 22:17:22,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1513614.6666666667, ans=0.0 2023-10-13 22:17:31,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.773e+02 1.902e+02 2.049e+02 2.632e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-13 22:17:47,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-10-13 22:17:47,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.37 vs. limit=10.0 2023-10-13 22:18:25,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1513848.0, ans=0.125 2023-10-13 22:18:40,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1513894.6666666667, ans=0.0 2023-10-13 22:19:04,590 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:19:35,582 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.795e+02 1.975e+02 2.243e+02 3.397e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-13 22:20:01,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1514174.6666666667, ans=0.0 2023-10-13 22:20:10,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1514221.3333333333, ans=0.2 2023-10-13 22:20:33,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1514314.6666666667, ans=0.0 2023-10-13 22:20:40,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1514314.6666666667, ans=0.0 2023-10-13 22:20:52,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1514361.3333333333, ans=0.0 2023-10-13 22:21:05,714 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.61 vs. limit=10.0 2023-10-13 22:21:08,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1514454.6666666667, ans=0.0 2023-10-13 22:21:11,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514454.6666666667, ans=0.1 2023-10-13 22:21:33,179 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:21:40,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.753e+02 1.869e+02 2.082e+02 2.888e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-13 22:21:57,509 INFO [train.py:1031] (3/4) Epoch 24, batch 10500, loss[loss=0.1846, simple_loss=0.2524, pruned_loss=0.05834, over 12508.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2794, pruned_loss=0.0475, over 32617299.63 frames. ], batch size: 440, lr: 1.43e-03, grad_scale: 32.0 2023-10-13 22:21:57,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1514641.3333333333, ans=0.125 2023-10-13 22:22:02,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1514641.3333333333, ans=0.0 2023-10-13 22:22:54,634 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:23:31,455 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-13 22:23:53,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.828e+02 1.976e+02 2.160e+02 3.338e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-13 22:24:09,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.79 vs. limit=15.0 2023-10-13 22:24:19,587 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2023-10-13 22:24:26,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515154.6666666667, ans=0.1 2023-10-13 22:25:08,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1515294.6666666667, ans=0.125 2023-10-13 22:25:25,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1515341.3333333333, ans=0.0 2023-10-13 22:25:25,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1515341.3333333333, ans=0.125 2023-10-13 22:25:26,962 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.34 vs. limit=12.0 2023-10-13 22:25:28,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1515341.3333333333, ans=0.025 2023-10-13 22:25:40,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1515388.0, ans=0.125 2023-10-13 22:25:40,763 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-10-13 22:26:04,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1515481.3333333333, ans=0.125 2023-10-13 22:26:05,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.842e+02 1.948e+02 2.162e+02 2.869e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-13 22:26:19,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-10-13 22:26:28,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-10-13 22:26:34,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1515621.3333333333, ans=0.125 2023-10-13 22:26:46,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2023-10-13 22:27:04,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1515714.6666666667, ans=0.0 2023-10-13 22:27:18,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1515761.3333333333, ans=15.0 2023-10-13 22:27:32,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-10-13 22:27:32,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1515808.0, ans=0.1 2023-10-13 22:27:33,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.18 vs. limit=22.5 2023-10-13 22:27:51,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1515854.6666666667, ans=0.0 2023-10-13 22:27:55,790 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.04 vs. limit=15.0 2023-10-13 22:28:07,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1515948.0, ans=0.0 2023-10-13 22:28:12,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1515948.0, ans=0.05 2023-10-13 22:28:16,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.901e+02 2.142e+02 2.481e+02 3.748e+02, threshold=4.284e+02, percent-clipped=0.0 2023-10-13 22:28:19,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-10-13 22:28:31,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-10-13 22:28:32,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1516041.3333333333, ans=0.2 2023-10-13 22:28:42,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1516088.0, ans=0.1 2023-10-13 22:28:52,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1516134.6666666667, ans=0.04949747468305833 2023-10-13 22:29:01,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1516134.6666666667, ans=0.025 2023-10-13 22:29:04,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1516181.3333333333, ans=0.1 2023-10-13 22:29:06,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1516181.3333333333, ans=0.125 2023-10-13 22:29:18,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1516228.0, ans=0.125 2023-10-13 22:29:32,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1516274.6666666667, ans=0.125 2023-10-13 22:29:50,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1516321.3333333333, ans=0.2 2023-10-13 22:30:13,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1516414.6666666667, ans=0.0 2023-10-13 22:30:15,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.751e+02 1.974e+02 2.144e+02 2.984e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-13 22:30:36,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1516508.0, ans=0.0 2023-10-13 22:30:38,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-13 22:30:44,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1516554.6666666667, ans=0.5 2023-10-13 22:30:50,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1516554.6666666667, ans=0.125 2023-10-13 22:31:07,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1516648.0, ans=0.0 2023-10-13 22:31:38,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-10-13 22:31:42,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1516788.0, ans=0.0 2023-10-13 22:31:54,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1516834.6666666667, ans=0.125 2023-10-13 22:31:56,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1516834.6666666667, ans=0.0 2023-10-13 22:32:09,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.774e+02 1.959e+02 2.223e+02 3.325e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-13 22:32:11,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1516881.3333333333, ans=0.1 2023-10-13 22:32:17,180 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:32:24,140 INFO [train.py:1031] (3/4) Epoch 24, batch 11000, loss[loss=0.1796, simple_loss=0.2755, pruned_loss=0.04192, over 16109.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2795, pruned_loss=0.04765, over 32660594.26 frames. ], batch size: 43, lr: 1.42e-03, grad_scale: 32.0 2023-10-13 22:32:34,225 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.98 vs. limit=10.0 2023-10-13 22:32:46,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1517068.0, ans=0.0 2023-10-13 22:33:00,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1517114.6666666667, ans=0.0 2023-10-13 22:33:12,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1517161.3333333333, ans=0.0 2023-10-13 22:33:20,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.20 vs. limit=6.0 2023-10-13 22:33:43,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1517254.6666666667, ans=0.0 2023-10-13 22:34:09,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.852e+02 1.978e+02 2.339e+02 3.502e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 22:34:20,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1517394.6666666667, ans=0.0 2023-10-13 22:34:53,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1517534.6666666667, ans=0.0 2023-10-13 22:34:59,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1517534.6666666667, ans=0.0 2023-10-13 22:35:17,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1517581.3333333333, ans=0.1 2023-10-13 22:35:29,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.72 vs. limit=10.0 2023-10-13 22:35:30,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1517628.0, ans=0.125 2023-10-13 22:35:43,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1517674.6666666667, ans=0.1 2023-10-13 22:36:00,862 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.56 vs. limit=15.0 2023-10-13 22:36:01,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.69 vs. limit=15.0 2023-10-13 22:36:07,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1517768.0, ans=0.04949747468305833 2023-10-13 22:36:08,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1517768.0, ans=0.125 2023-10-13 22:36:15,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1517814.6666666667, ans=0.0 2023-10-13 22:36:17,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1517814.6666666667, ans=0.0 2023-10-13 22:36:18,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.716e+02 1.878e+02 2.081e+02 2.828e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 22:36:46,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1517954.6666666667, ans=0.125 2023-10-13 22:36:51,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1517954.6666666667, ans=0.0 2023-10-13 22:36:51,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1517954.6666666667, ans=0.0 2023-10-13 22:36:59,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1518001.3333333333, ans=0.125 2023-10-13 22:37:00,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1518001.3333333333, ans=0.125 2023-10-13 22:37:09,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1518048.0, ans=0.125 2023-10-13 22:37:22,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1518094.6666666667, ans=0.2 2023-10-13 22:37:47,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1518188.0, ans=0.125 2023-10-13 22:37:52,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1518188.0, ans=0.2 2023-10-13 22:37:55,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-10-13 22:38:00,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1518234.6666666667, ans=0.0 2023-10-13 22:38:02,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-10-13 22:38:24,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.816e+02 1.944e+02 2.162e+02 2.848e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-13 22:38:39,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1518328.0, ans=0.0 2023-10-13 22:38:46,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1518374.6666666667, ans=0.1 2023-10-13 22:39:46,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1518561.3333333333, ans=0.0 2023-10-13 22:39:55,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1518608.0, ans=0.1 2023-10-13 22:40:04,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1518654.6666666667, ans=0.125 2023-10-13 22:40:32,221 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-10-13 22:40:37,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.805e+02 1.963e+02 2.115e+02 3.367e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-13 22:40:48,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1518794.6666666667, ans=0.125 2023-10-13 22:41:13,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-13 22:41:19,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-13 22:41:53,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1519028.0, ans=0.125 2023-10-13 22:42:00,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-13 22:42:24,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1519121.3333333333, ans=0.0 2023-10-13 22:42:29,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1519121.3333333333, ans=0.125 2023-10-13 22:42:40,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=15.0 2023-10-13 22:42:49,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1519214.6666666667, ans=0.125 2023-10-13 22:42:57,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.072e+02 2.262e+02 2.562e+02 3.409e+02, threshold=4.525e+02, percent-clipped=0.0 2023-10-13 22:42:58,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-13 22:43:11,313 INFO [train.py:1031] (3/4) Epoch 24, batch 11500, loss[loss=0.1692, simple_loss=0.2588, pruned_loss=0.03987, over 16212.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2793, pruned_loss=0.04745, over 32691151.11 frames. ], batch size: 50, lr: 1.42e-03, grad_scale: 16.0 2023-10-13 22:43:31,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=15.0 2023-10-13 22:43:32,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519354.6666666667, ans=0.1 2023-10-13 22:43:34,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1519401.3333333333, ans=0.1 2023-10-13 22:43:48,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1519401.3333333333, ans=0.0 2023-10-13 22:43:54,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1519448.0, ans=0.025 2023-10-13 22:43:56,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.71 vs. limit=15.0 2023-10-13 22:43:59,299 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-13 22:44:03,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1519494.6666666667, ans=0.125 2023-10-13 22:44:13,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1519541.3333333333, ans=0.0 2023-10-13 22:44:17,502 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.68 vs. limit=15.0 2023-10-13 22:44:55,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1519681.3333333333, ans=0.0 2023-10-13 22:44:59,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1519681.3333333333, ans=0.125 2023-10-13 22:45:03,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.711e+02 1.857e+02 2.023e+02 2.645e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-13 22:45:16,270 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:45:16,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1519728.0, ans=0.0 2023-10-13 22:45:17,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1519728.0, ans=0.125 2023-10-13 22:45:32,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1519821.3333333333, ans=0.1 2023-10-13 22:45:37,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1519821.3333333333, ans=0.125 2023-10-13 22:45:38,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1519821.3333333333, ans=0.125 2023-10-13 22:45:52,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1519868.0, ans=0.125 2023-10-13 22:45:58,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-13 22:46:00,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1519914.6666666667, ans=0.0 2023-10-13 22:46:15,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1519961.3333333333, ans=0.025 2023-10-13 22:46:23,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1520008.0, ans=0.0 2023-10-13 22:46:37,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1520054.6666666667, ans=0.1 2023-10-13 22:46:47,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1520101.3333333333, ans=0.125 2023-10-13 22:46:50,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1520101.3333333333, ans=0.125 2023-10-13 22:47:01,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-13 22:47:06,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.836e+02 2.015e+02 2.307e+02 3.757e+02, threshold=4.031e+02, percent-clipped=1.0 2023-10-13 22:47:26,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1520241.3333333333, ans=0.0 2023-10-13 22:47:27,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1520241.3333333333, ans=0.1 2023-10-13 22:47:41,634 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-13 22:48:21,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1520474.6666666667, ans=0.125 2023-10-13 22:48:31,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1520474.6666666667, ans=0.125 2023-10-13 22:48:37,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1520521.3333333333, ans=0.125 2023-10-13 22:48:44,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1520521.3333333333, ans=0.125 2023-10-13 22:49:16,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.804e+02 1.912e+02 2.236e+02 2.756e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-13 22:49:43,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1520708.0, ans=0.1 2023-10-13 22:50:02,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.63 vs. limit=15.0 2023-10-13 22:50:06,950 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=15.0 2023-10-13 22:50:25,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.07 vs. limit=15.0 2023-10-13 22:50:50,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1520988.0, ans=0.125 2023-10-13 22:50:52,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1520988.0, ans=0.125 2023-10-13 22:50:56,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1520988.0, ans=0.2 2023-10-13 22:51:02,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1521034.6666666667, ans=0.125 2023-10-13 22:51:23,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.773e+02 1.904e+02 2.073e+02 2.644e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-13 22:51:35,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1521128.0, ans=0.2 2023-10-13 22:51:40,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1521174.6666666667, ans=0.125 2023-10-13 22:51:55,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1521221.3333333333, ans=0.125 2023-10-13 22:52:16,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1521268.0, ans=0.125 2023-10-13 22:52:19,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-10-13 22:52:40,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1521361.3333333333, ans=0.125 2023-10-13 22:52:41,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1521361.3333333333, ans=0.05 2023-10-13 22:52:42,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-13 22:52:50,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1521408.0, ans=0.0 2023-10-13 22:53:32,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.796e+02 1.987e+02 2.373e+02 3.644e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-13 22:53:32,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1521548.0, ans=0.07 2023-10-13 22:53:43,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.76 vs. limit=15.0 2023-10-13 22:53:46,445 INFO [train.py:1031] (3/4) Epoch 24, batch 12000, loss[loss=0.1872, simple_loss=0.2844, pruned_loss=0.045, over 16867.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2795, pruned_loss=0.04727, over 32719446.88 frames. ], batch size: 87, lr: 1.42e-03, grad_scale: 32.0 2023-10-13 22:53:58,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2023-10-13 22:54:03,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1521688.0, ans=0.0 2023-10-13 22:54:46,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-10-13 22:55:11,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1521968.0, ans=0.0 2023-10-13 22:55:22,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1521968.0, ans=0.125 2023-10-13 22:55:23,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-10-13 22:55:24,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1522014.6666666667, ans=0.0 2023-10-13 22:55:25,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1522014.6666666667, ans=0.2 2023-10-13 22:55:32,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.790e+02 2.061e+02 2.349e+02 3.231e+02, threshold=4.123e+02, percent-clipped=0.0 2023-10-13 22:55:40,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522061.3333333333, ans=0.1 2023-10-13 22:55:57,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1522154.6666666667, ans=0.2 2023-10-13 22:56:01,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-10-13 22:56:03,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1522154.6666666667, ans=0.0 2023-10-13 22:56:09,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1522201.3333333333, ans=0.0 2023-10-13 22:56:13,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1522201.3333333333, ans=0.0 2023-10-13 22:56:13,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1522201.3333333333, ans=0.1 2023-10-13 22:56:36,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1522294.6666666667, ans=0.2 2023-10-13 22:56:37,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=22.5 2023-10-13 22:56:41,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-13 22:56:51,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1522341.3333333333, ans=0.05 2023-10-13 22:57:27,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1522481.3333333333, ans=0.125 2023-10-13 22:57:31,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1522481.3333333333, ans=0.1 2023-10-13 22:57:32,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.840e+02 2.002e+02 2.278e+02 2.944e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 22:57:38,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-10-13 22:58:05,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1522621.3333333333, ans=0.0 2023-10-13 22:58:15,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1522668.0, ans=0.0 2023-10-13 22:58:21,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1522668.0, ans=0.125 2023-10-13 22:58:25,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1522668.0, ans=0.04949747468305833 2023-10-13 22:58:25,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-10-13 22:58:35,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1522714.6666666667, ans=0.125 2023-10-13 22:58:38,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=22.5 2023-10-13 22:58:53,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-10-13 22:59:17,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.66 vs. limit=15.0 2023-10-13 22:59:25,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1522948.0, ans=0.2 2023-10-13 22:59:36,075 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.827e+02 1.975e+02 2.172e+02 3.021e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-13 22:59:41,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2023-10-13 22:59:46,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1522994.6666666667, ans=0.125 2023-10-13 22:59:52,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1523041.3333333333, ans=0.125 2023-10-13 23:00:03,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1523088.0, ans=0.125 2023-10-13 23:00:12,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1523134.6666666667, ans=0.0 2023-10-13 23:00:59,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1523274.6666666667, ans=0.2 2023-10-13 23:01:13,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1523321.3333333333, ans=0.1 2023-10-13 23:01:29,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1523414.6666666667, ans=0.125 2023-10-13 23:01:31,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1523414.6666666667, ans=0.125 2023-10-13 23:01:39,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.799e+02 1.936e+02 2.110e+02 2.856e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-13 23:01:47,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1523461.3333333333, ans=0.0 2023-10-13 23:01:57,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-10-13 23:01:58,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1523508.0, ans=0.2 2023-10-13 23:02:01,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1523508.0, ans=0.1 2023-10-13 23:02:03,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1523554.6666666667, ans=0.125 2023-10-13 23:02:05,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1523554.6666666667, ans=0.0 2023-10-13 23:02:16,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-10-13 23:02:31,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1523648.0, ans=0.0 2023-10-13 23:02:32,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1523648.0, ans=10.0 2023-10-13 23:02:33,754 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:02:52,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-10-13 23:03:11,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1523788.0, ans=0.125 2023-10-13 23:03:17,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.95 vs. limit=12.0 2023-10-13 23:03:21,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=1523834.6666666667, ans=12.0 2023-10-13 23:03:23,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-10-13 23:03:25,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523834.6666666667, ans=0.1 2023-10-13 23:03:25,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1523834.6666666667, ans=0.125 2023-10-13 23:03:38,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.859e+02 2.002e+02 2.187e+02 2.889e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 23:03:41,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1523928.0, ans=0.125 2023-10-13 23:03:50,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1523928.0, ans=0.0 2023-10-13 23:03:52,693 INFO [train.py:1031] (3/4) Epoch 24, batch 12500, loss[loss=0.204, simple_loss=0.2949, pruned_loss=0.0565, over 16953.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2792, pruned_loss=0.04727, over 32748637.76 frames. ], batch size: 156, lr: 1.42e-03, grad_scale: 8.0 2023-10-13 23:03:57,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1523974.6666666667, ans=0.1 2023-10-13 23:04:22,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-10-13 23:04:25,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1524114.6666666667, ans=0.125 2023-10-13 23:04:26,079 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-10-13 23:04:47,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-13 23:05:34,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1524348.0, ans=0.125 2023-10-13 23:05:40,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.805e+02 1.914e+02 2.061e+02 2.800e+02, threshold=3.829e+02, percent-clipped=0.0 2023-10-13 23:05:59,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1524441.3333333333, ans=0.125 2023-10-13 23:06:04,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524488.0, ans=0.1 2023-10-13 23:06:50,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1524674.6666666667, ans=0.1 2023-10-13 23:06:55,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-10-13 23:07:13,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1524721.3333333333, ans=0.2 2023-10-13 23:07:15,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1524768.0, ans=0.125 2023-10-13 23:07:16,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1524768.0, ans=0.04949747468305833 2023-10-13 23:07:22,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524768.0, ans=0.1 2023-10-13 23:07:30,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1524814.6666666667, ans=0.125 2023-10-13 23:07:36,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.863e+02 2.040e+02 2.295e+02 3.466e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-13 23:07:57,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-13 23:08:21,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1525001.3333333333, ans=0.0 2023-10-13 23:08:57,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1525141.3333333333, ans=0.2 2023-10-13 23:09:07,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1525188.0, ans=0.125 2023-10-13 23:09:19,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1525234.6666666667, ans=0.125 2023-10-13 23:09:20,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1525234.6666666667, ans=0.125 2023-10-13 23:09:39,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.851e+02 1.996e+02 2.248e+02 4.535e+02, threshold=3.993e+02, percent-clipped=1.0 2023-10-13 23:09:41,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1525328.0, ans=0.2 2023-10-13 23:09:48,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1525328.0, ans=0.05 2023-10-13 23:10:19,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1525468.0, ans=0.0 2023-10-13 23:10:50,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1525561.3333333333, ans=0.2 2023-10-13 23:10:51,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525561.3333333333, ans=0.1 2023-10-13 23:10:52,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1525561.3333333333, ans=0.125 2023-10-13 23:11:00,814 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=22.5 2023-10-13 23:11:04,727 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-13 23:11:48,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.799e+02 1.946e+02 2.116e+02 3.019e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 23:11:50,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1525794.6666666667, ans=0.125 2023-10-13 23:12:03,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1525841.3333333333, ans=0.0 2023-10-13 23:12:17,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.57 vs. limit=15.0 2023-10-13 23:12:18,415 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-13 23:12:26,898 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=12.0 2023-10-13 23:12:32,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1525934.6666666667, ans=0.2 2023-10-13 23:12:35,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.09 vs. limit=10.0 2023-10-13 23:12:43,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1525981.3333333333, ans=0.0 2023-10-13 23:12:54,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1526028.0, ans=0.125 2023-10-13 23:12:54,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1526028.0, ans=0.125 2023-10-13 23:13:18,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1526121.3333333333, ans=0.125 2023-10-13 23:13:24,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1526168.0, ans=0.0 2023-10-13 23:13:36,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1526214.6666666667, ans=0.0 2023-10-13 23:13:38,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1526214.6666666667, ans=0.07 2023-10-13 23:13:48,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.829e+02 2.058e+02 2.276e+02 3.037e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-13 23:13:52,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1526261.3333333333, ans=0.0 2023-10-13 23:13:56,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1526261.3333333333, ans=0.2 2023-10-13 23:13:59,259 INFO [train.py:1031] (3/4) Epoch 24, batch 13000, loss[loss=0.1815, simple_loss=0.2803, pruned_loss=0.04135, over 16870.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2798, pruned_loss=0.04741, over 32757493.99 frames. ], batch size: 93, lr: 1.42e-03, grad_scale: 16.0 2023-10-13 23:13:59,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1526308.0, ans=0.125 2023-10-13 23:14:08,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.66 vs. limit=15.0 2023-10-13 23:14:38,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1526401.3333333333, ans=0.0 2023-10-13 23:14:40,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1526401.3333333333, ans=0.125 2023-10-13 23:15:19,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1526541.3333333333, ans=0.0 2023-10-13 23:15:37,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1526634.6666666667, ans=0.125 2023-10-13 23:15:40,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1526634.6666666667, ans=0.1 2023-10-13 23:15:58,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1526681.3333333333, ans=0.1 2023-10-13 23:16:02,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=15.0 2023-10-13 23:16:03,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.815e+02 2.009e+02 2.199e+02 3.102e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-13 23:16:08,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1526728.0, ans=0.0 2023-10-13 23:16:08,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1526728.0, ans=0.125 2023-10-13 23:16:17,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1526774.6666666667, ans=0.0 2023-10-13 23:16:26,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1526821.3333333333, ans=0.125 2023-10-13 23:16:37,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1526868.0, ans=0.125 2023-10-13 23:16:55,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-10-13 23:17:03,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1526961.3333333333, ans=0.125 2023-10-13 23:17:09,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1526961.3333333333, ans=0.125 2023-10-13 23:17:28,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1527054.6666666667, ans=0.125 2023-10-13 23:17:46,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-13 23:18:01,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1527148.0, ans=0.125 2023-10-13 23:18:06,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1527148.0, ans=0.0 2023-10-13 23:18:08,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1527194.6666666667, ans=0.1 2023-10-13 23:18:08,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1527194.6666666667, ans=0.2 2023-10-13 23:18:09,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.769e+02 1.979e+02 2.273e+02 3.106e+02, threshold=3.958e+02, percent-clipped=0.0 2023-10-13 23:18:16,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1527194.6666666667, ans=0.125 2023-10-13 23:18:17,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.71 vs. limit=10.0 2023-10-13 23:18:19,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-10-13 23:18:31,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-13 23:19:02,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1527381.3333333333, ans=0.125 2023-10-13 23:19:06,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1527428.0, ans=0.125 2023-10-13 23:19:13,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1527428.0, ans=0.125 2023-10-13 23:19:16,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1527428.0, ans=0.125 2023-10-13 23:19:17,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1527428.0, ans=0.125 2023-10-13 23:19:17,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1527428.0, ans=0.125 2023-10-13 23:19:18,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-10-13 23:19:39,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1527521.3333333333, ans=0.0 2023-10-13 23:19:56,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527614.6666666667, ans=0.1 2023-10-13 23:20:09,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.830e+02 1.986e+02 2.279e+02 2.838e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 23:20:16,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1527661.3333333333, ans=10.0 2023-10-13 23:20:20,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1527708.0, ans=0.0 2023-10-13 23:20:24,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1527708.0, ans=0.0 2023-10-13 23:20:36,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1527754.6666666667, ans=0.0 2023-10-13 23:20:42,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1527801.3333333333, ans=0.0 2023-10-13 23:20:46,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1527801.3333333333, ans=0.125 2023-10-13 23:20:50,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1527801.3333333333, ans=0.125 2023-10-13 23:20:50,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1527801.3333333333, ans=0.2 2023-10-13 23:21:24,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1527941.3333333333, ans=0.2 2023-10-13 23:21:26,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1527941.3333333333, ans=0.125 2023-10-13 23:21:29,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1527988.0, ans=0.0 2023-10-13 23:21:41,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1528034.6666666667, ans=0.5 2023-10-13 23:21:54,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1528081.3333333333, ans=0.035 2023-10-13 23:22:05,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1528128.0, ans=0.1 2023-10-13 23:22:08,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.796e+02 1.955e+02 2.122e+02 6.659e+02, threshold=3.910e+02, percent-clipped=1.0 2023-10-13 23:22:10,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1528128.0, ans=0.04949747468305833 2023-10-13 23:22:40,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-10-13 23:22:43,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1528268.0, ans=0.2 2023-10-13 23:22:54,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1528314.6666666667, ans=0.5 2023-10-13 23:23:01,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1528314.6666666667, ans=0.125 2023-10-13 23:23:02,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.72 vs. limit=22.5 2023-10-13 23:23:16,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1528408.0, ans=0.0 2023-10-13 23:23:26,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.86 vs. limit=15.0 2023-10-13 23:23:46,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1528501.3333333333, ans=0.2 2023-10-13 23:23:49,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1528501.3333333333, ans=0.1 2023-10-13 23:23:55,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1528548.0, ans=0.035 2023-10-13 23:24:05,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.735e+02 1.892e+02 2.089e+02 3.838e+02, threshold=3.784e+02, percent-clipped=0.0 2023-10-13 23:24:15,518 INFO [train.py:1031] (3/4) Epoch 24, batch 13500, loss[loss=0.1872, simple_loss=0.2789, pruned_loss=0.04778, over 16846.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2791, pruned_loss=0.04724, over 32775965.53 frames. ], batch size: 155, lr: 1.42e-03, grad_scale: 16.0 2023-10-13 23:24:19,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1528641.3333333333, ans=0.0 2023-10-13 23:24:23,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1528641.3333333333, ans=0.0 2023-10-13 23:24:27,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.28 vs. limit=15.0 2023-10-13 23:24:52,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1528781.3333333333, ans=0.0 2023-10-13 23:24:53,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=22.5 2023-10-13 23:24:55,272 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-10-13 23:25:03,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1528828.0, ans=0.125 2023-10-13 23:25:27,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1528921.3333333333, ans=0.125 2023-10-13 23:25:27,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1528921.3333333333, ans=0.125 2023-10-13 23:25:28,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1528921.3333333333, ans=0.0 2023-10-13 23:25:38,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1528968.0, ans=0.125 2023-10-13 23:25:45,806 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:25:47,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 23:26:03,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-13 23:26:03,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.809e+02 1.977e+02 2.163e+02 3.315e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 23:26:16,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1529108.0, ans=0.1 2023-10-13 23:26:24,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1529154.6666666667, ans=0.2 2023-10-13 23:26:26,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1529154.6666666667, ans=0.0 2023-10-13 23:26:37,369 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:26:50,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1529248.0, ans=0.125 2023-10-13 23:26:57,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1529294.6666666667, ans=0.2 2023-10-13 23:27:07,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1529341.3333333333, ans=0.125 2023-10-13 23:27:09,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529341.3333333333, ans=0.1 2023-10-13 23:27:42,146 INFO [train.py:1031] (3/4) Epoch 25, batch 0, loss[loss=0.159, simple_loss=0.2576, pruned_loss=0.03017, over 16813.00 frames. ], tot_loss[loss=0.159, simple_loss=0.2576, pruned_loss=0.03017, over 16813.00 frames. ], batch size: 98, lr: 1.39e-03, grad_scale: 32.0 2023-10-13 23:27:42,147 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-13 23:27:51,630 INFO [train.py:1063] (3/4) Epoch 25, validation: loss=0.2131, simple_loss=0.2998, pruned_loss=0.06319, over 1020973.00 frames. 2023-10-13 23:27:51,631 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-13 23:28:28,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529504.6666666667, ans=0.1 2023-10-13 23:28:32,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1529504.6666666667, ans=0.0 2023-10-13 23:28:36,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.770e+02 1.984e+02 2.294e+02 3.341e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-13 23:28:42,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1529551.3333333333, ans=0.0 2023-10-13 23:28:42,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=15.0 2023-10-13 23:29:06,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1529644.6666666667, ans=0.0 2023-10-13 23:29:22,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1529691.3333333333, ans=0.0 2023-10-13 23:29:38,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=12.0 2023-10-13 23:30:15,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1529924.6666666667, ans=0.0 2023-10-13 23:30:33,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.761e+02 1.852e+02 2.014e+02 2.887e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-13 23:30:38,724 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=15.0 2023-10-13 23:30:48,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-13 23:31:07,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1530111.3333333333, ans=0.09899494936611666 2023-10-13 23:31:13,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1530158.0, ans=0.0 2023-10-13 23:31:16,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1530158.0, ans=0.0 2023-10-13 23:31:31,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1530204.6666666667, ans=0.0 2023-10-13 23:31:44,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-13 23:32:08,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1530391.3333333333, ans=0.05 2023-10-13 23:32:19,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1530438.0, ans=0.0 2023-10-13 23:32:20,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1530438.0, ans=0.125 2023-10-13 23:32:20,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1530438.0, ans=0.125 2023-10-13 23:32:20,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1530438.0, ans=0.2 2023-10-13 23:32:25,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.826e+02 1.997e+02 2.261e+02 3.100e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-13 23:32:44,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-13 23:32:55,550 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:33:01,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1530578.0, ans=15.0 2023-10-13 23:33:11,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1530624.6666666667, ans=0.1 2023-10-13 23:33:27,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=22.5 2023-10-13 23:33:34,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1530718.0, ans=0.125 2023-10-13 23:33:37,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1530718.0, ans=0.035 2023-10-13 23:33:56,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.71 vs. limit=22.5 2023-10-13 23:34:01,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1530811.3333333333, ans=0.125 2023-10-13 23:34:04,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1530811.3333333333, ans=0.0 2023-10-13 23:34:20,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1530904.6666666667, ans=0.05 2023-10-13 23:34:22,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1530904.6666666667, ans=0.0 2023-10-13 23:34:24,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1530904.6666666667, ans=0.95 2023-10-13 23:34:25,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1530904.6666666667, ans=0.05 2023-10-13 23:34:27,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.817e+02 1.969e+02 2.186e+02 3.001e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-13 23:34:30,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.58 vs. limit=10.0 2023-10-13 23:34:35,561 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.27 vs. limit=22.5 2023-10-13 23:34:39,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1530951.3333333333, ans=0.07 2023-10-13 23:34:46,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1530998.0, ans=0.2 2023-10-13 23:34:47,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1530998.0, ans=0.125 2023-10-13 23:35:12,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1531091.3333333333, ans=0.0 2023-10-13 23:35:22,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1531138.0, ans=0.09899494936611666 2023-10-13 23:35:32,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1531184.6666666667, ans=0.125 2023-10-13 23:35:35,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-10-13 23:35:44,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1531231.3333333333, ans=0.125 2023-10-13 23:36:06,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1531324.6666666667, ans=0.0 2023-10-13 23:36:09,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1531324.6666666667, ans=0.0 2023-10-13 23:36:13,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1531371.3333333333, ans=0.125 2023-10-13 23:36:14,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1531371.3333333333, ans=0.07 2023-10-13 23:36:21,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.855e+02 2.081e+02 2.292e+02 3.776e+02, threshold=4.162e+02, percent-clipped=0.0 2023-10-13 23:36:21,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1531371.3333333333, ans=0.125 2023-10-13 23:36:49,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.62 vs. limit=15.0 2023-10-13 23:36:58,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=15.0 2023-10-13 23:37:18,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-10-13 23:37:24,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1531604.6666666667, ans=0.125 2023-10-13 23:37:27,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1531604.6666666667, ans=0.125 2023-10-13 23:37:28,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1531651.3333333333, ans=0.125 2023-10-13 23:37:39,761 INFO [train.py:1031] (3/4) Epoch 25, batch 500, loss[loss=0.1714, simple_loss=0.2655, pruned_loss=0.03861, over 16685.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2799, pruned_loss=0.04795, over 7259365.61 frames. ], batch size: 202, lr: 1.39e-03, grad_scale: 16.0 2023-10-13 23:37:46,811 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-10-13 23:37:52,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1531744.6666666667, ans=0.125 2023-10-13 23:38:18,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1531838.0, ans=0.125 2023-10-13 23:38:21,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.814e+02 1.952e+02 2.226e+02 3.293e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-13 23:38:42,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1531931.3333333333, ans=0.125 2023-10-13 23:38:56,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1531978.0, ans=0.1 2023-10-13 23:39:14,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1532024.6666666667, ans=0.0 2023-10-13 23:39:25,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1532071.3333333333, ans=0.125 2023-10-13 23:39:34,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1532118.0, ans=0.125 2023-10-13 23:39:35,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1532118.0, ans=10.0 2023-10-13 23:40:00,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1532211.3333333333, ans=0.1 2023-10-13 23:40:10,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1532258.0, ans=0.125 2023-10-13 23:40:23,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.878e+02 2.021e+02 2.199e+02 3.014e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-13 23:40:51,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1532444.6666666667, ans=0.125 2023-10-13 23:40:55,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1532444.6666666667, ans=0.1 2023-10-13 23:40:56,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-10-13 23:41:05,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1532491.3333333333, ans=0.0 2023-10-13 23:41:16,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1532538.0, ans=0.0 2023-10-13 23:41:29,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1532584.6666666667, ans=0.125 2023-10-13 23:41:41,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1532631.3333333333, ans=0.2 2023-10-13 23:41:47,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1532631.3333333333, ans=0.0 2023-10-13 23:41:48,166 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:41:59,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1532724.6666666667, ans=0.0 2023-10-13 23:42:22,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.865e+02 2.046e+02 2.235e+02 2.852e+02, threshold=4.091e+02, percent-clipped=0.0 2023-10-13 23:42:24,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1532818.0, ans=0.125 2023-10-13 23:42:39,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1532864.6666666667, ans=0.125 2023-10-13 23:42:59,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1532958.0, ans=0.125 2023-10-13 23:42:59,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=22.5 2023-10-13 23:43:02,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1532958.0, ans=0.125 2023-10-13 23:43:30,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1533051.3333333333, ans=15.0 2023-10-13 23:43:38,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533098.0, ans=0.1 2023-10-13 23:43:44,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-13 23:44:01,772 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=22.5 2023-10-13 23:44:23,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1533238.0, ans=0.125 2023-10-13 23:44:26,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.769e+02 1.917e+02 2.098e+02 2.654e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-13 23:44:39,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1533284.6666666667, ans=0.125 2023-10-13 23:44:47,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-10-13 23:44:48,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533331.3333333333, ans=0.1 2023-10-13 23:44:59,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.11 vs. limit=15.0 2023-10-13 23:45:00,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1533378.0, ans=0.0 2023-10-13 23:45:01,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1533378.0, ans=0.0 2023-10-13 23:45:01,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1533378.0, ans=0.125 2023-10-13 23:45:06,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1533424.6666666667, ans=0.125 2023-10-13 23:45:06,546 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.12 vs. limit=15.0 2023-10-13 23:45:11,857 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-13 23:45:19,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1533471.3333333333, ans=0.0 2023-10-13 23:45:20,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533471.3333333333, ans=0.1 2023-10-13 23:45:28,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1533518.0, ans=0.125 2023-10-13 23:45:42,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533564.6666666667, ans=0.1 2023-10-13 23:45:43,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1533564.6666666667, ans=0.125 2023-10-13 23:45:48,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1533564.6666666667, ans=0.125 2023-10-13 23:45:51,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1533611.3333333333, ans=0.125 2023-10-13 23:45:52,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.26 vs. limit=15.0 2023-10-13 23:46:08,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1533658.0, ans=0.1 2023-10-13 23:46:13,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1533658.0, ans=0.125 2023-10-13 23:46:20,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533704.6666666667, ans=0.1 2023-10-13 23:46:23,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1533704.6666666667, ans=0.0 2023-10-13 23:46:26,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.779e+02 1.966e+02 2.153e+02 2.792e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-13 23:46:30,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1533751.3333333333, ans=0.0 2023-10-13 23:46:45,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1533798.0, ans=0.125 2023-10-13 23:46:46,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1533798.0, ans=0.125 2023-10-13 23:46:47,344 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.28 vs. limit=22.5 2023-10-13 23:46:47,979 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:47:05,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1533891.3333333333, ans=0.125 2023-10-13 23:47:07,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1533891.3333333333, ans=0.07 2023-10-13 23:47:07,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1533891.3333333333, ans=0.125 2023-10-13 23:47:14,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1533938.0, ans=0.0 2023-10-13 23:47:16,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533938.0, ans=0.1 2023-10-13 23:47:38,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1534031.3333333333, ans=0.0 2023-10-13 23:47:38,730 INFO [train.py:1031] (3/4) Epoch 25, batch 1000, loss[loss=0.1809, simple_loss=0.2814, pruned_loss=0.0402, over 16915.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2804, pruned_loss=0.04837, over 12907138.90 frames. ], batch size: 93, lr: 1.39e-03, grad_scale: 16.0 2023-10-13 23:47:49,958 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-13 23:47:52,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1534078.0, ans=22.5 2023-10-13 23:47:53,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1534078.0, ans=0.125 2023-10-13 23:47:55,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1534078.0, ans=0.0 2023-10-13 23:47:56,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1534078.0, ans=0.125 2023-10-13 23:48:03,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1534124.6666666667, ans=0.0 2023-10-13 23:48:17,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1534171.3333333333, ans=0.125 2023-10-13 23:48:20,769 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-13 23:48:21,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.711e+02 1.879e+02 2.079e+02 2.784e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 23:48:37,313 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.96 vs. limit=6.0 2023-10-13 23:48:43,105 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1534264.6666666667, ans=0.125 2023-10-13 23:48:49,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.99 vs. limit=15.0 2023-10-13 23:49:06,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1534404.6666666667, ans=0.125 2023-10-13 23:50:18,852 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.05 vs. limit=10.0 2023-10-13 23:50:19,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.776e+02 1.889e+02 2.086e+02 2.979e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-13 23:50:26,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1534684.6666666667, ans=0.015 2023-10-13 23:50:30,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1534684.6666666667, ans=0.125 2023-10-13 23:50:30,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1534684.6666666667, ans=0.125 2023-10-13 23:50:42,790 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-13 23:50:44,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.94 vs. limit=12.0 2023-10-13 23:50:47,431 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:51:15,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1534871.3333333333, ans=0.125 2023-10-13 23:51:42,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1534964.6666666667, ans=0.2 2023-10-13 23:52:00,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535011.3333333333, ans=0.1 2023-10-13 23:52:07,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-10-13 23:52:23,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1535104.6666666667, ans=0.09899494936611666 2023-10-13 23:52:27,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.775e+02 2.001e+02 2.469e+02 4.532e+02, threshold=4.002e+02, percent-clipped=2.0 2023-10-13 23:52:36,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1535151.3333333333, ans=0.0 2023-10-13 23:52:41,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1535198.0, ans=0.125 2023-10-13 23:52:49,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535198.0, ans=0.1 2023-10-13 23:52:49,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-10-13 23:53:07,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1535291.3333333333, ans=0.1 2023-10-13 23:53:08,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1535291.3333333333, ans=0.0 2023-10-13 23:53:29,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535384.6666666667, ans=0.1 2023-10-13 23:53:49,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1535478.0, ans=0.125 2023-10-13 23:53:52,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535478.0, ans=0.1 2023-10-13 23:54:16,977 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:54:23,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.789e+02 1.944e+02 2.132e+02 3.188e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-13 23:54:31,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1535618.0, ans=0.125 2023-10-13 23:54:32,117 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=12.0 2023-10-13 23:54:34,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1535618.0, ans=0.125 2023-10-13 23:54:43,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1535664.6666666667, ans=0.125 2023-10-13 23:54:44,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2023-10-13 23:54:55,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1535711.3333333333, ans=0.125 2023-10-13 23:54:55,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.34 vs. limit=12.0 2023-10-13 23:55:00,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535758.0, ans=0.1 2023-10-13 23:55:08,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.74 vs. limit=5.0 2023-10-13 23:55:20,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1535804.6666666667, ans=0.125 2023-10-13 23:55:32,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1535851.3333333333, ans=0.0 2023-10-13 23:55:35,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1535898.0, ans=0.125 2023-10-13 23:55:48,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-10-13 23:55:57,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1535944.6666666667, ans=15.0 2023-10-13 23:56:03,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-10-13 23:56:22,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.763e+02 1.946e+02 2.213e+02 3.108e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 23:56:56,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-13 23:56:58,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536224.6666666667, ans=0.1 2023-10-13 23:57:38,177 INFO [train.py:1031] (3/4) Epoch 25, batch 1500, loss[loss=0.1842, simple_loss=0.2762, pruned_loss=0.04615, over 16805.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2788, pruned_loss=0.04745, over 17313248.00 frames. ], batch size: 188, lr: 1.39e-03, grad_scale: 32.0 2023-10-13 23:57:42,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-10-13 23:58:13,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-10-13 23:58:22,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.783e+02 1.904e+02 2.076e+02 2.792e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 23:58:26,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1536551.3333333333, ans=0.125 2023-10-13 23:58:30,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1536551.3333333333, ans=0.125 2023-10-13 23:58:51,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1536644.6666666667, ans=0.125 2023-10-13 23:59:11,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1536691.3333333333, ans=0.0 2023-10-13 23:59:11,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1536691.3333333333, ans=0.125 2023-10-13 23:59:13,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1536738.0, ans=0.0 2023-10-13 23:59:18,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1536738.0, ans=0.125 2023-10-13 23:59:20,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1536738.0, ans=15.0 2023-10-13 23:59:26,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1536784.6666666667, ans=0.125 2023-10-13 23:59:31,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1536784.6666666667, ans=0.125 2023-10-13 23:59:39,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-13 23:59:51,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536878.0, ans=0.1 2023-10-14 00:00:04,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1536924.6666666667, ans=0.2 2023-10-14 00:00:08,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1536971.3333333333, ans=0.125 2023-10-14 00:00:22,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1536971.3333333333, ans=0.015 2023-10-14 00:00:25,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.773e+02 1.945e+02 2.171e+02 3.592e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-14 00:00:36,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537018.0, ans=0.1 2023-10-14 00:01:08,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1537158.0, ans=0.05 2023-10-14 00:01:11,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=15.0 2023-10-14 00:01:23,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1537204.6666666667, ans=0.125 2023-10-14 00:01:35,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-14 00:01:46,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1537298.0, ans=0.0 2023-10-14 00:01:46,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1537298.0, ans=0.1 2023-10-14 00:01:50,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1537298.0, ans=0.125 2023-10-14 00:01:50,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1537298.0, ans=0.125 2023-10-14 00:01:52,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1537344.6666666667, ans=0.0 2023-10-14 00:01:53,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1537344.6666666667, ans=0.0 2023-10-14 00:01:56,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1537344.6666666667, ans=0.125 2023-10-14 00:02:01,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1537344.6666666667, ans=0.125 2023-10-14 00:02:04,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1537391.3333333333, ans=0.0 2023-10-14 00:02:27,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.895e+02 2.057e+02 2.323e+02 3.446e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-14 00:02:37,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1537531.3333333333, ans=0.125 2023-10-14 00:02:40,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-10-14 00:02:46,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1537531.3333333333, ans=0.125 2023-10-14 00:02:50,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1537578.0, ans=0.0 2023-10-14 00:03:15,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.72 vs. limit=15.0 2023-10-14 00:03:38,896 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-10-14 00:03:56,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1537811.3333333333, ans=0.125 2023-10-14 00:04:08,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1537858.0, ans=0.0 2023-10-14 00:04:08,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1537858.0, ans=0.1 2023-10-14 00:04:10,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537858.0, ans=0.1 2023-10-14 00:04:12,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1537858.0, ans=0.0 2023-10-14 00:04:27,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.743e+02 1.876e+02 2.012e+02 2.826e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-14 00:04:44,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-10-14 00:04:44,843 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.67 vs. limit=15.0 2023-10-14 00:05:39,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1538231.3333333333, ans=0.125 2023-10-14 00:05:50,733 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2023-10-14 00:06:00,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1538278.0, ans=0.07 2023-10-14 00:06:07,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1538324.6666666667, ans=0.125 2023-10-14 00:06:11,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1538371.3333333333, ans=0.1 2023-10-14 00:06:29,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.812e+02 1.977e+02 2.245e+02 3.927e+02, threshold=3.953e+02, percent-clipped=1.0 2023-10-14 00:06:35,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1538418.0, ans=0.125 2023-10-14 00:06:38,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1538418.0, ans=0.125 2023-10-14 00:07:24,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-10-14 00:07:25,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1538604.6666666667, ans=0.125 2023-10-14 00:07:31,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-10-14 00:07:34,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1538651.3333333333, ans=0.125 2023-10-14 00:07:49,197 INFO [train.py:1031] (3/4) Epoch 25, batch 2000, loss[loss=0.1871, simple_loss=0.2846, pruned_loss=0.04475, over 16671.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2793, pruned_loss=0.04749, over 20738982.36 frames. ], batch size: 202, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:07:53,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1538698.0, ans=0.07 2023-10-14 00:08:31,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1538838.0, ans=0.5 2023-10-14 00:08:49,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.798e+02 1.934e+02 2.110e+02 2.952e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-14 00:08:49,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1538884.6666666667, ans=0.0 2023-10-14 00:08:54,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1538884.6666666667, ans=0.05 2023-10-14 00:09:13,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1538978.0, ans=0.125 2023-10-14 00:09:23,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1538978.0, ans=0.0 2023-10-14 00:09:28,456 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=15.0 2023-10-14 00:09:59,369 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.69 vs. limit=15.0 2023-10-14 00:10:58,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1539258.0, ans=0.125 2023-10-14 00:10:59,113 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:11:20,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.757e+02 2.000e+02 2.181e+02 3.208e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-14 00:11:22,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1539351.3333333333, ans=0.125 2023-10-14 00:11:28,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1539351.3333333333, ans=0.1 2023-10-14 00:11:38,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2023-10-14 00:11:42,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1539398.0, ans=0.125 2023-10-14 00:11:46,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1539444.6666666667, ans=0.1 2023-10-14 00:11:56,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1539444.6666666667, ans=0.0 2023-10-14 00:11:56,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1539444.6666666667, ans=10.0 2023-10-14 00:12:08,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1539491.3333333333, ans=0.0 2023-10-14 00:12:09,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1539491.3333333333, ans=0.125 2023-10-14 00:12:26,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-14 00:12:28,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1539584.6666666667, ans=0.125 2023-10-14 00:12:30,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1539584.6666666667, ans=0.95 2023-10-14 00:12:45,479 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=12.0 2023-10-14 00:12:49,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1539678.0, ans=0.125 2023-10-14 00:12:50,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-10-14 00:13:20,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.797e+02 1.953e+02 2.198e+02 2.698e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-14 00:13:28,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.63 vs. limit=6.0 2023-10-14 00:13:44,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1539911.3333333333, ans=0.0 2023-10-14 00:14:00,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.25 vs. limit=10.0 2023-10-14 00:14:15,210 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.26 vs. limit=15.0 2023-10-14 00:14:58,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1540191.3333333333, ans=0.125 2023-10-14 00:14:58,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1540191.3333333333, ans=0.07 2023-10-14 00:15:02,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540191.3333333333, ans=0.1 2023-10-14 00:15:07,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1540238.0, ans=0.125 2023-10-14 00:15:07,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1540238.0, ans=0.125 2023-10-14 00:15:08,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1540238.0, ans=0.125 2023-10-14 00:15:14,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1540238.0, ans=0.0 2023-10-14 00:15:18,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.863e+02 2.021e+02 2.260e+02 2.919e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 00:15:25,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540284.6666666667, ans=0.1 2023-10-14 00:15:31,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540331.3333333333, ans=0.1 2023-10-14 00:15:37,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1540331.3333333333, ans=0.125 2023-10-14 00:15:42,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1540378.0, ans=0.125 2023-10-14 00:15:58,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1540424.6666666667, ans=0.0 2023-10-14 00:15:59,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1540424.6666666667, ans=0.1 2023-10-14 00:15:59,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1540424.6666666667, ans=0.125 2023-10-14 00:16:34,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1540564.6666666667, ans=0.0 2023-10-14 00:16:51,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-10-14 00:17:00,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1540704.6666666667, ans=0.0 2023-10-14 00:17:14,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.887e+02 2.023e+02 2.268e+02 3.456e+02, threshold=4.046e+02, percent-clipped=0.0 2023-10-14 00:17:19,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1540751.3333333333, ans=0.125 2023-10-14 00:17:33,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-14 00:17:47,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-10-14 00:18:12,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1540984.6666666667, ans=0.07 2023-10-14 00:18:22,043 INFO [train.py:1031] (3/4) Epoch 25, batch 2500, loss[loss=0.179, simple_loss=0.2721, pruned_loss=0.04295, over 16402.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2795, pruned_loss=0.04756, over 23418103.07 frames. ], batch size: 50, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:18:24,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1541031.3333333333, ans=0.125 2023-10-14 00:18:34,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1541078.0, ans=0.2 2023-10-14 00:18:35,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1541078.0, ans=0.125 2023-10-14 00:18:38,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1541078.0, ans=0.125 2023-10-14 00:19:09,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.818e+02 1.984e+02 2.207e+02 3.452e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 00:19:27,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1541264.6666666667, ans=22.5 2023-10-14 00:19:34,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1541311.3333333333, ans=0.125 2023-10-14 00:19:38,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1541358.0, ans=0.1 2023-10-14 00:19:51,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1541404.6666666667, ans=0.125 2023-10-14 00:20:12,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.34 vs. limit=10.0 2023-10-14 00:20:36,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.25 vs. limit=15.0 2023-10-14 00:20:57,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541684.6666666667, ans=0.1 2023-10-14 00:21:01,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.793e+02 1.951e+02 2.216e+02 3.118e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-14 00:21:05,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1541684.6666666667, ans=0.0 2023-10-14 00:21:12,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1541731.3333333333, ans=0.2 2023-10-14 00:21:19,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1541731.3333333333, ans=0.2 2023-10-14 00:21:29,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541778.0, ans=0.1 2023-10-14 00:21:32,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1541778.0, ans=0.125 2023-10-14 00:21:39,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541824.6666666667, ans=0.1 2023-10-14 00:22:03,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1541918.0, ans=0.125 2023-10-14 00:22:10,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1541964.6666666667, ans=0.125 2023-10-14 00:22:15,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1541964.6666666667, ans=0.09899494936611666 2023-10-14 00:22:15,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=15.0 2023-10-14 00:22:38,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1542058.0, ans=0.1 2023-10-14 00:22:46,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1542104.6666666667, ans=0.0 2023-10-14 00:22:50,511 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.26 vs. limit=15.0 2023-10-14 00:23:02,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.831e+02 2.019e+02 2.257e+02 3.678e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 00:23:05,976 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:23:42,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1542291.3333333333, ans=0.0 2023-10-14 00:24:25,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1542431.3333333333, ans=0.0 2023-10-14 00:24:27,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1542431.3333333333, ans=0.2 2023-10-14 00:24:40,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1542478.0, ans=0.125 2023-10-14 00:25:02,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1542571.3333333333, ans=0.125 2023-10-14 00:25:08,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1542571.3333333333, ans=0.2 2023-10-14 00:25:20,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.692e+02 1.896e+02 2.139e+02 3.227e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-14 00:25:30,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.70 vs. limit=10.0 2023-10-14 00:25:32,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1542664.6666666667, ans=0.125 2023-10-14 00:25:33,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.76 vs. limit=15.0 2023-10-14 00:26:44,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1542898.0, ans=0.0 2023-10-14 00:26:49,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1542944.6666666667, ans=0.125 2023-10-14 00:26:57,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-10-14 00:26:59,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1542991.3333333333, ans=0.05 2023-10-14 00:26:59,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1542991.3333333333, ans=0.125 2023-10-14 00:27:04,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1542991.3333333333, ans=0.125 2023-10-14 00:27:12,311 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-10-14 00:27:16,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1543038.0, ans=0.125 2023-10-14 00:27:27,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.816e+02 2.044e+02 2.232e+02 3.404e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-14 00:27:35,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1543084.6666666667, ans=0.1 2023-10-14 00:27:45,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1543131.3333333333, ans=0.1 2023-10-14 00:27:53,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1543178.0, ans=0.125 2023-10-14 00:28:31,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1543318.0, ans=0.125 2023-10-14 00:28:32,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1543318.0, ans=0.1 2023-10-14 00:28:34,399 INFO [train.py:1031] (3/4) Epoch 25, batch 3000, loss[loss=0.1928, simple_loss=0.2831, pruned_loss=0.05124, over 16913.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2789, pruned_loss=0.04766, over 25501763.67 frames. ], batch size: 110, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:28:36,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1543364.6666666667, ans=0.2 2023-10-14 00:28:44,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1543411.3333333333, ans=0.2 2023-10-14 00:28:46,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1543411.3333333333, ans=0.125 2023-10-14 00:28:50,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1543411.3333333333, ans=0.125 2023-10-14 00:28:51,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1543411.3333333333, ans=0.0 2023-10-14 00:29:08,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1543504.6666666667, ans=0.2 2023-10-14 00:29:13,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-10-14 00:29:15,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1543504.6666666667, ans=0.0 2023-10-14 00:29:17,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1543504.6666666667, ans=22.5 2023-10-14 00:29:21,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1543551.3333333333, ans=0.125 2023-10-14 00:29:24,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.780e+02 1.981e+02 2.163e+02 2.755e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 00:29:29,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1543551.3333333333, ans=0.0 2023-10-14 00:29:31,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.83 vs. limit=15.0 2023-10-14 00:29:35,176 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-14 00:29:48,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1543644.6666666667, ans=0.0 2023-10-14 00:29:52,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1543644.6666666667, ans=0.2 2023-10-14 00:30:25,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1543784.6666666667, ans=0.2 2023-10-14 00:30:43,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1543831.3333333333, ans=0.125 2023-10-14 00:30:52,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1543878.0, ans=0.125 2023-10-14 00:31:22,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.80 vs. limit=5.0 2023-10-14 00:31:27,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1544018.0, ans=0.0 2023-10-14 00:31:32,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.796e+02 1.954e+02 2.120e+02 2.721e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-14 00:31:40,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.91 vs. limit=15.0 2023-10-14 00:31:45,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1544064.6666666667, ans=0.2 2023-10-14 00:31:46,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1544064.6666666667, ans=0.125 2023-10-14 00:31:57,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1544111.3333333333, ans=0.125 2023-10-14 00:32:14,366 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-10-14 00:32:29,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-14 00:32:29,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1544251.3333333333, ans=0.0 2023-10-14 00:32:54,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1544344.6666666667, ans=0.125 2023-10-14 00:33:11,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=15.0 2023-10-14 00:33:18,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1544438.0, ans=15.0 2023-10-14 00:33:20,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1544438.0, ans=0.0 2023-10-14 00:33:31,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=12.0 2023-10-14 00:33:34,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1544484.6666666667, ans=0.0 2023-10-14 00:33:35,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.785e+02 1.924e+02 2.119e+02 2.818e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-14 00:33:52,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1544531.3333333333, ans=0.025 2023-10-14 00:33:56,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1544531.3333333333, ans=0.1 2023-10-14 00:34:12,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544624.6666666667, ans=0.1 2023-10-14 00:34:17,141 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:34:26,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544671.3333333333, ans=0.1 2023-10-14 00:34:29,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1544671.3333333333, ans=0.0 2023-10-14 00:34:40,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1544718.0, ans=0.2 2023-10-14 00:34:44,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1544718.0, ans=0.05 2023-10-14 00:35:25,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1544904.6666666667, ans=0.125 2023-10-14 00:35:30,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1544904.6666666667, ans=0.0 2023-10-14 00:35:41,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.813e+02 1.968e+02 2.203e+02 3.239e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-14 00:36:04,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-10-14 00:36:38,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=22.5 2023-10-14 00:36:46,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545184.6666666667, ans=0.1 2023-10-14 00:36:58,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.02 vs. limit=15.0 2023-10-14 00:37:32,169 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:37:34,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1545371.3333333333, ans=0.125 2023-10-14 00:37:36,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-10-14 00:37:41,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1545418.0, ans=0.125 2023-10-14 00:37:43,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.782e+02 1.887e+02 2.042e+02 2.713e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-14 00:37:58,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1545464.6666666667, ans=0.2 2023-10-14 00:38:42,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1545651.3333333333, ans=0.125 2023-10-14 00:38:50,341 INFO [train.py:1031] (3/4) Epoch 25, batch 3500, loss[loss=0.1958, simple_loss=0.2872, pruned_loss=0.0522, over 16751.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2786, pruned_loss=0.04756, over 27118892.08 frames. ], batch size: 202, lr: 1.38e-03, grad_scale: 16.0 2023-10-14 00:39:09,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1545744.6666666667, ans=0.125 2023-10-14 00:39:10,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545744.6666666667, ans=0.1 2023-10-14 00:39:11,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1545744.6666666667, ans=0.125 2023-10-14 00:39:29,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1545838.0, ans=0.125 2023-10-14 00:39:32,118 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.60 vs. limit=6.0 2023-10-14 00:39:36,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-14 00:39:37,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545884.6666666667, ans=0.1 2023-10-14 00:39:42,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.855e+02 2.066e+02 2.343e+02 2.955e+02, threshold=4.131e+02, percent-clipped=0.0 2023-10-14 00:40:00,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-14 00:40:06,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1545978.0, ans=0.1 2023-10-14 00:40:18,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=12.0 2023-10-14 00:40:29,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1546024.6666666667, ans=0.125 2023-10-14 00:40:31,722 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=22.5 2023-10-14 00:40:33,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1546071.3333333333, ans=0.125 2023-10-14 00:40:49,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1546118.0, ans=0.07 2023-10-14 00:40:54,420 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:41:00,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=22.5 2023-10-14 00:41:03,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1546164.6666666667, ans=0.125 2023-10-14 00:41:09,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1546164.6666666667, ans=10.0 2023-10-14 00:41:13,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1546164.6666666667, ans=0.0 2023-10-14 00:41:14,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.05 vs. limit=15.0 2023-10-14 00:41:30,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1546258.0, ans=0.125 2023-10-14 00:41:40,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1546258.0, ans=0.0 2023-10-14 00:41:41,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1546258.0, ans=0.125 2023-10-14 00:42:02,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.887e+02 2.077e+02 2.363e+02 2.995e+02, threshold=4.153e+02, percent-clipped=0.0 2023-10-14 00:42:07,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1546351.3333333333, ans=0.05 2023-10-14 00:42:25,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1546444.6666666667, ans=22.5 2023-10-14 00:42:38,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1546491.3333333333, ans=22.5 2023-10-14 00:42:50,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1546491.3333333333, ans=0.2 2023-10-14 00:43:14,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1546584.6666666667, ans=0.0 2023-10-14 00:43:27,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-10-14 00:44:21,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.34 vs. limit=10.0 2023-10-14 00:44:22,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1546771.3333333333, ans=10.0 2023-10-14 00:44:31,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1546771.3333333333, ans=0.125 2023-10-14 00:44:37,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1546818.0, ans=0.125 2023-10-14 00:44:43,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.715e+02 1.840e+02 2.077e+02 2.935e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-14 00:45:14,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.04 vs. limit=15.0 2023-10-14 00:45:30,927 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.22 vs. limit=6.0 2023-10-14 00:45:35,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=15.0 2023-10-14 00:45:51,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1547004.6666666667, ans=0.0 2023-10-14 00:45:57,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1547051.3333333333, ans=0.125 2023-10-14 00:46:54,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1547191.3333333333, ans=0.5 2023-10-14 00:47:00,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1547238.0, ans=0.2 2023-10-14 00:47:16,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1547284.6666666667, ans=0.125 2023-10-14 00:47:17,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.776e+02 1.879e+02 2.184e+02 2.734e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-14 00:47:30,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1547331.3333333333, ans=0.0 2023-10-14 00:48:13,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1547471.3333333333, ans=0.125 2023-10-14 00:48:15,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1547471.3333333333, ans=0.0 2023-10-14 00:48:21,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1547471.3333333333, ans=0.125 2023-10-14 00:48:27,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1547471.3333333333, ans=0.0 2023-10-14 00:48:35,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=1547518.0, ans=12.0 2023-10-14 00:48:39,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1547518.0, ans=0.0 2023-10-14 00:49:44,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1547751.3333333333, ans=0.125 2023-10-14 00:49:47,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.771e+02 1.917e+02 2.194e+02 3.241e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-14 00:49:47,919 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=22.5 2023-10-14 00:49:52,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547751.3333333333, ans=0.1 2023-10-14 00:49:59,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-14 00:50:29,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-14 00:50:30,286 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-14 00:51:02,490 INFO [train.py:1031] (3/4) Epoch 25, batch 4000, loss[loss=0.1827, simple_loss=0.2785, pruned_loss=0.04347, over 16852.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2783, pruned_loss=0.04762, over 28372698.67 frames. ], batch size: 146, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:51:34,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1548078.0, ans=0.2 2023-10-14 00:52:08,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548218.0, ans=0.1 2023-10-14 00:52:13,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.861e+02 2.042e+02 2.213e+02 2.987e+02, threshold=4.083e+02, percent-clipped=0.0 2023-10-14 00:52:20,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1548264.6666666667, ans=0.0 2023-10-14 00:52:22,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1548264.6666666667, ans=0.125 2023-10-14 00:52:25,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548264.6666666667, ans=0.1 2023-10-14 00:53:04,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.71 vs. limit=15.0 2023-10-14 00:53:14,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1548404.6666666667, ans=0.0 2023-10-14 00:53:14,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=12.0 2023-10-14 00:53:17,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1548451.3333333333, ans=0.125 2023-10-14 00:53:20,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1548451.3333333333, ans=0.125 2023-10-14 00:53:42,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1548544.6666666667, ans=0.2 2023-10-14 00:53:59,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1548591.3333333333, ans=0.2 2023-10-14 00:54:38,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1548684.6666666667, ans=0.125 2023-10-14 00:54:43,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.801e+02 1.971e+02 2.166e+02 3.356e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-14 00:54:46,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1548684.6666666667, ans=0.0 2023-10-14 00:54:52,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1548731.3333333333, ans=0.025 2023-10-14 00:55:09,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1548778.0, ans=0.125 2023-10-14 00:55:12,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1548778.0, ans=0.125 2023-10-14 00:55:30,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1548824.6666666667, ans=0.125 2023-10-14 00:55:33,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1548824.6666666667, ans=0.0 2023-10-14 00:55:34,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=22.5 2023-10-14 00:55:42,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1548871.3333333333, ans=0.125 2023-10-14 00:55:44,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=12.0 2023-10-14 00:55:54,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1548871.3333333333, ans=0.0 2023-10-14 00:55:58,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.41 vs. limit=10.0 2023-10-14 00:56:10,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-10-14 00:57:44,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1549151.3333333333, ans=10.0 2023-10-14 00:57:49,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.816e+02 2.035e+02 2.630e+02 4.385e+02, threshold=4.070e+02, percent-clipped=1.0 2023-10-14 00:58:01,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1549198.0, ans=0.125 2023-10-14 00:58:02,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1549198.0, ans=0.125 2023-10-14 00:58:23,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1549244.6666666667, ans=0.125 2023-10-14 00:58:55,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1549338.0, ans=0.0 2023-10-14 00:59:11,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-10-14 00:59:23,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1549384.6666666667, ans=0.125 2023-10-14 00:59:37,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1549431.3333333333, ans=0.0 2023-10-14 00:59:54,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1549478.0, ans=0.0 2023-10-14 00:59:54,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1549478.0, ans=0.2 2023-10-14 01:00:07,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1549524.6666666667, ans=0.125 2023-10-14 01:00:37,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1549618.0, ans=0.0 2023-10-14 01:00:39,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-10-14 01:00:47,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.779e+02 1.981e+02 2.147e+02 2.752e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-14 01:00:58,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1549664.6666666667, ans=0.125 2023-10-14 01:01:04,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1549664.6666666667, ans=0.125 2023-10-14 01:01:38,765 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:01:41,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1549804.6666666667, ans=0.05 2023-10-14 01:01:45,150 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:01:46,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1549804.6666666667, ans=0.0 2023-10-14 01:02:12,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1549898.0, ans=0.125 2023-10-14 01:02:59,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1549991.3333333333, ans=0.125 2023-10-14 01:03:13,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-10-14 01:04:00,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1550084.6666666667, ans=0.1 2023-10-14 01:04:01,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.881e+02 2.079e+02 2.309e+02 3.185e+02, threshold=4.159e+02, percent-clipped=0.0 2023-10-14 01:04:09,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1550131.3333333333, ans=0.125 2023-10-14 01:04:14,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1550131.3333333333, ans=0.125 2023-10-14 01:04:17,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1550131.3333333333, ans=0.2 2023-10-14 01:04:25,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1550178.0, ans=0.125 2023-10-14 01:04:43,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1550224.6666666667, ans=0.2 2023-10-14 01:04:56,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=15.0 2023-10-14 01:05:14,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1550318.0, ans=0.125 2023-10-14 01:05:25,854 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:05:29,577 INFO [train.py:1031] (3/4) Epoch 25, batch 4500, loss[loss=0.1971, simple_loss=0.2839, pruned_loss=0.0551, over 16120.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2786, pruned_loss=0.04747, over 29352569.32 frames. ], batch size: 296, lr: 1.38e-03, grad_scale: 16.0 2023-10-14 01:05:36,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1550364.6666666667, ans=0.0 2023-10-14 01:05:38,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-14 01:05:39,197 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.27 vs. limit=15.0 2023-10-14 01:05:44,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.89 vs. limit=15.0 2023-10-14 01:05:47,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1550411.3333333333, ans=0.125 2023-10-14 01:06:01,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1550458.0, ans=0.125 2023-10-14 01:06:10,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-10-14 01:06:20,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1550504.6666666667, ans=0.125 2023-10-14 01:06:26,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1550504.6666666667, ans=0.125 2023-10-14 01:06:26,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1550504.6666666667, ans=0.2 2023-10-14 01:06:29,422 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-10-14 01:06:36,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1550504.6666666667, ans=15.0 2023-10-14 01:06:41,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.00 vs. limit=22.5 2023-10-14 01:06:42,302 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.29 vs. limit=15.0 2023-10-14 01:06:46,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1550551.3333333333, ans=0.125 2023-10-14 01:06:53,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.765e+02 1.889e+02 2.110e+02 3.343e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-14 01:07:02,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1550598.0, ans=0.125 2023-10-14 01:07:03,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1550598.0, ans=0.125 2023-10-14 01:07:23,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1550644.6666666667, ans=0.0 2023-10-14 01:07:31,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1550691.3333333333, ans=0.025 2023-10-14 01:07:36,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1550691.3333333333, ans=0.07 2023-10-14 01:08:21,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1550784.6666666667, ans=0.2 2023-10-14 01:08:39,276 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:08:45,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1550878.0, ans=0.125 2023-10-14 01:08:58,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1550878.0, ans=0.0 2023-10-14 01:09:08,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1550924.6666666667, ans=0.125 2023-10-14 01:09:17,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1550924.6666666667, ans=0.125 2023-10-14 01:09:31,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1550971.3333333333, ans=10.0 2023-10-14 01:09:33,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1550971.3333333333, ans=0.0 2023-10-14 01:09:43,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1551018.0, ans=0.0 2023-10-14 01:09:55,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1551018.0, ans=0.0 2023-10-14 01:10:00,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.916e+02 2.048e+02 2.213e+02 3.103e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 01:10:27,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1551111.3333333333, ans=0.125 2023-10-14 01:10:28,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1551111.3333333333, ans=0.125 2023-10-14 01:10:34,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1551111.3333333333, ans=0.0 2023-10-14 01:10:35,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1551111.3333333333, ans=0.0 2023-10-14 01:10:51,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1551158.0, ans=0.125 2023-10-14 01:10:51,370 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:11:04,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-10-14 01:11:49,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.04 vs. limit=22.5 2023-10-14 01:12:38,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1551438.0, ans=0.125 2023-10-14 01:12:42,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1551484.6666666667, ans=0.1 2023-10-14 01:12:44,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1551484.6666666667, ans=0.125 2023-10-14 01:12:53,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.813e+02 2.018e+02 2.178e+02 2.751e+02, threshold=4.036e+02, percent-clipped=0.0 2023-10-14 01:13:23,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1551578.0, ans=0.2 2023-10-14 01:13:44,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1551624.6666666667, ans=0.0 2023-10-14 01:14:00,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1551671.3333333333, ans=0.0 2023-10-14 01:14:02,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.64 vs. limit=15.0 2023-10-14 01:14:18,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1551718.0, ans=0.125 2023-10-14 01:14:36,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1551764.6666666667, ans=0.125 2023-10-14 01:14:46,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1551811.3333333333, ans=0.125 2023-10-14 01:14:53,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1551811.3333333333, ans=0.1 2023-10-14 01:15:03,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1551858.0, ans=0.125 2023-10-14 01:16:06,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.775e+02 1.914e+02 2.145e+02 2.747e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-14 01:16:06,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1551951.3333333333, ans=0.05 2023-10-14 01:16:36,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552044.6666666667, ans=0.1 2023-10-14 01:16:55,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552091.3333333333, ans=0.1 2023-10-14 01:16:58,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.21 vs. limit=15.0 2023-10-14 01:17:02,089 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2023-10-14 01:17:21,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1552138.0, ans=0.0 2023-10-14 01:17:30,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1552184.6666666667, ans=0.0 2023-10-14 01:17:46,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1552231.3333333333, ans=0.2 2023-10-14 01:18:17,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1552278.0, ans=0.0 2023-10-14 01:18:41,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1552371.3333333333, ans=0.05 2023-10-14 01:19:11,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.783e+02 1.972e+02 2.129e+02 2.952e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 01:19:12,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1552418.0, ans=0.2 2023-10-14 01:20:17,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552651.3333333333, ans=0.1 2023-10-14 01:20:18,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1552651.3333333333, ans=0.125 2023-10-14 01:20:21,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.19 vs. limit=15.0 2023-10-14 01:20:29,626 INFO [train.py:1031] (3/4) Epoch 25, batch 5000, loss[loss=0.158, simple_loss=0.2614, pruned_loss=0.02729, over 16887.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2784, pruned_loss=0.04772, over 30085661.62 frames. ], batch size: 104, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:20:37,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2023-10-14 01:20:52,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1552744.6666666667, ans=0.0 2023-10-14 01:20:52,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1552744.6666666667, ans=0.125 2023-10-14 01:20:56,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1552791.3333333333, ans=0.0 2023-10-14 01:20:56,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1552791.3333333333, ans=0.1 2023-10-14 01:21:05,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1552838.0, ans=0.0 2023-10-14 01:21:06,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1552838.0, ans=0.0 2023-10-14 01:21:12,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=15.0 2023-10-14 01:21:13,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1552838.0, ans=0.1 2023-10-14 01:21:14,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1552838.0, ans=0.2 2023-10-14 01:21:22,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1552884.6666666667, ans=0.1 2023-10-14 01:21:26,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1552884.6666666667, ans=0.125 2023-10-14 01:21:28,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1552884.6666666667, ans=0.125 2023-10-14 01:21:28,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552884.6666666667, ans=0.1 2023-10-14 01:21:29,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=12.0 2023-10-14 01:21:31,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.780e+02 1.937e+02 2.152e+02 3.729e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-14 01:21:45,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1552931.3333333333, ans=0.125 2023-10-14 01:22:06,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1553024.6666666667, ans=0.125 2023-10-14 01:22:07,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1553024.6666666667, ans=0.2 2023-10-14 01:22:27,032 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:22:27,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1553118.0, ans=0.125 2023-10-14 01:23:22,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1553258.0, ans=0.125 2023-10-14 01:23:25,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1553304.6666666667, ans=0.0 2023-10-14 01:23:26,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1553304.6666666667, ans=0.125 2023-10-14 01:23:38,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-10-14 01:23:42,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1553351.3333333333, ans=0.125 2023-10-14 01:23:48,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.815e+02 2.017e+02 2.279e+02 2.951e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-14 01:24:04,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.21 vs. limit=10.0 2023-10-14 01:24:25,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1553491.3333333333, ans=0.0 2023-10-14 01:24:51,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1553538.0, ans=0.1 2023-10-14 01:24:57,368 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:24:59,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1553584.6666666667, ans=0.0 2023-10-14 01:26:35,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.838e+02 2.064e+02 2.440e+02 3.125e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 01:26:37,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1553864.6666666667, ans=0.2 2023-10-14 01:27:02,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1553911.3333333333, ans=0.125 2023-10-14 01:27:03,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1553911.3333333333, ans=0.125 2023-10-14 01:27:13,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1553958.0, ans=0.0 2023-10-14 01:27:21,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1553958.0, ans=0.125 2023-10-14 01:27:23,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1553958.0, ans=0.0 2023-10-14 01:27:32,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1554004.6666666667, ans=10.0 2023-10-14 01:27:52,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2023-10-14 01:27:52,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.22 vs. limit=22.5 2023-10-14 01:27:56,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1554051.3333333333, ans=0.125 2023-10-14 01:28:18,218 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-10-14 01:28:55,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1554238.0, ans=0.125 2023-10-14 01:29:11,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1554238.0, ans=0.125 2023-10-14 01:29:24,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.87 vs. limit=6.0 2023-10-14 01:29:25,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.754e+02 1.879e+02 2.100e+02 3.034e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-14 01:29:35,349 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.18 vs. limit=10.0 2023-10-14 01:29:41,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1554331.3333333333, ans=0.0 2023-10-14 01:29:55,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1554378.0, ans=0.5 2023-10-14 01:30:16,304 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=22.5 2023-10-14 01:30:57,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-10-14 01:31:07,201 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:31:14,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1554611.3333333333, ans=0.125 2023-10-14 01:31:28,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1554611.3333333333, ans=0.2 2023-10-14 01:31:37,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1554658.0, ans=0.09899494936611666 2023-10-14 01:31:52,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1554704.6666666667, ans=0.125 2023-10-14 01:32:03,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1554751.3333333333, ans=0.125 2023-10-14 01:32:03,283 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-10-14 01:32:07,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1554751.3333333333, ans=0.1 2023-10-14 01:32:12,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.741e+02 1.894e+02 2.090e+02 3.309e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-14 01:32:23,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1554798.0, ans=0.125 2023-10-14 01:32:40,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1554891.3333333333, ans=0.0 2023-10-14 01:32:58,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1554938.0, ans=0.2 2023-10-14 01:33:06,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1554938.0, ans=0.0 2023-10-14 01:33:14,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-10-14 01:33:22,418 INFO [train.py:1031] (3/4) Epoch 25, batch 5500, loss[loss=0.1766, simple_loss=0.2663, pruned_loss=0.04344, over 16881.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2784, pruned_loss=0.04763, over 30702040.83 frames. ], batch size: 82, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:33:30,112 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-10-14 01:33:40,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1555078.0, ans=0.125 2023-10-14 01:33:44,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1555078.0, ans=0.1 2023-10-14 01:33:50,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1555124.6666666667, ans=0.0 2023-10-14 01:34:01,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1555171.3333333333, ans=0.125 2023-10-14 01:34:24,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.796e+02 1.966e+02 2.130e+02 3.043e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-14 01:34:30,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1555264.6666666667, ans=0.0 2023-10-14 01:34:31,877 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:35:04,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1555358.0, ans=0.125 2023-10-14 01:35:06,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1555404.6666666667, ans=0.2 2023-10-14 01:35:29,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1555451.3333333333, ans=0.0 2023-10-14 01:35:53,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1555544.6666666667, ans=0.125 2023-10-14 01:36:08,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1555591.3333333333, ans=0.125 2023-10-14 01:36:13,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1555591.3333333333, ans=0.2 2023-10-14 01:36:22,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1555638.0, ans=0.125 2023-10-14 01:36:41,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1555684.6666666667, ans=0.0 2023-10-14 01:36:46,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.823e+02 1.954e+02 2.145e+02 2.802e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-14 01:36:54,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.27 vs. limit=22.5 2023-10-14 01:36:57,910 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-10-14 01:37:47,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1555871.3333333333, ans=0.125 2023-10-14 01:37:56,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1555918.0, ans=0.5 2023-10-14 01:38:12,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1555918.0, ans=0.1 2023-10-14 01:38:17,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1555964.6666666667, ans=0.125 2023-10-14 01:38:34,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1556011.3333333333, ans=0.1 2023-10-14 01:38:36,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556011.3333333333, ans=0.1 2023-10-14 01:38:43,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1556058.0, ans=0.125 2023-10-14 01:38:50,229 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:39:16,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1556151.3333333333, ans=15.0 2023-10-14 01:39:19,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.866e+02 1.984e+02 2.200e+02 3.265e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 01:39:26,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.51 vs. limit=10.0 2023-10-14 01:39:33,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1556198.0, ans=0.125 2023-10-14 01:39:56,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1556291.3333333333, ans=0.125 2023-10-14 01:40:14,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556338.0, ans=0.1 2023-10-14 01:40:20,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1556384.6666666667, ans=0.2 2023-10-14 01:40:31,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2023-10-14 01:40:39,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-10-14 01:41:24,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1556524.6666666667, ans=0.04949747468305833 2023-10-14 01:41:48,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1556571.3333333333, ans=0.05 2023-10-14 01:41:56,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1556618.0, ans=0.125 2023-10-14 01:42:11,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.813e+02 1.976e+02 2.211e+02 3.514e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 01:42:30,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.43 vs. limit=22.5 2023-10-14 01:42:51,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1556711.3333333333, ans=0.125 2023-10-14 01:44:09,716 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.47 vs. limit=10.0 2023-10-14 01:44:20,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.67 vs. limit=5.0 2023-10-14 01:45:03,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1556991.3333333333, ans=0.125 2023-10-14 01:45:36,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=22.5 2023-10-14 01:45:50,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.770e+02 2.028e+02 2.296e+02 3.429e+02, threshold=4.056e+02, percent-clipped=0.0 2023-10-14 01:46:21,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1557178.0, ans=0.125 2023-10-14 01:46:45,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1557224.6666666667, ans=0.2 2023-10-14 01:46:52,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1557224.6666666667, ans=0.125 2023-10-14 01:46:53,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=22.5 2023-10-14 01:46:57,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1557271.3333333333, ans=0.125 2023-10-14 01:47:31,549 INFO [train.py:1031] (3/4) Epoch 25, batch 6000, loss[loss=0.1932, simple_loss=0.2869, pruned_loss=0.04971, over 16933.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2789, pruned_loss=0.04793, over 31154633.68 frames. ], batch size: 93, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:47:38,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.56 vs. limit=10.0 2023-10-14 01:48:13,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1557504.6666666667, ans=0.1 2023-10-14 01:48:27,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1557504.6666666667, ans=0.0 2023-10-14 01:48:40,091 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=22.5 2023-10-14 01:48:43,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.921e+02 2.127e+02 2.411e+02 3.205e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-14 01:48:50,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557598.0, ans=0.1 2023-10-14 01:49:08,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.34 vs. limit=10.0 2023-10-14 01:49:23,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1557738.0, ans=0.125 2023-10-14 01:49:27,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1557738.0, ans=0.0 2023-10-14 01:49:27,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-10-14 01:49:37,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1557784.6666666667, ans=0.125 2023-10-14 01:49:50,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1557831.3333333333, ans=0.0 2023-10-14 01:49:53,198 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:49:58,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1557878.0, ans=0.0 2023-10-14 01:50:02,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1557878.0, ans=0.125 2023-10-14 01:50:02,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1557924.6666666667, ans=0.2 2023-10-14 01:50:02,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1557924.6666666667, ans=0.0 2023-10-14 01:50:37,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.867e+02 2.061e+02 2.265e+02 3.155e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-14 01:50:55,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1558111.3333333333, ans=0.125 2023-10-14 01:51:22,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1558204.6666666667, ans=0.0 2023-10-14 01:52:02,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1558391.3333333333, ans=0.125 2023-10-14 01:52:21,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1558438.0, ans=0.125 2023-10-14 01:52:23,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1558484.6666666667, ans=0.1 2023-10-14 01:52:27,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1558484.6666666667, ans=0.125 2023-10-14 01:52:33,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.987e+02 2.187e+02 2.456e+02 3.411e+02, threshold=4.374e+02, percent-clipped=0.0 2023-10-14 01:52:43,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1558531.3333333333, ans=0.125 2023-10-14 01:52:52,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=15.0 2023-10-14 01:53:01,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1558624.6666666667, ans=0.125 2023-10-14 01:53:07,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=22.5 2023-10-14 01:53:15,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=1558671.3333333333, ans=0.02 2023-10-14 01:53:17,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1558671.3333333333, ans=0.2 2023-10-14 01:53:17,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1558671.3333333333, ans=0.125 2023-10-14 01:53:18,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.23 vs. limit=22.5 2023-10-14 01:53:29,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-14 01:53:36,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-10-14 01:53:40,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1558764.6666666667, ans=0.0 2023-10-14 01:53:53,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1558811.3333333333, ans=0.1 2023-10-14 01:54:02,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1558858.0, ans=0.0 2023-10-14 01:54:02,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1558858.0, ans=0.125 2023-10-14 01:54:03,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-10-14 01:54:04,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1558858.0, ans=0.2 2023-10-14 01:54:14,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1558904.6666666667, ans=0.125 2023-10-14 01:54:21,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1558904.6666666667, ans=0.0 2023-10-14 01:54:24,951 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:54:33,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=22.5 2023-10-14 01:54:35,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1558951.3333333333, ans=0.0 2023-10-14 01:54:38,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.848e+02 2.039e+02 2.330e+02 3.399e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-14 01:54:47,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1558998.0, ans=0.125 2023-10-14 01:54:47,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1558998.0, ans=0.125 2023-10-14 01:55:22,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=22.5 2023-10-14 01:55:33,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.47 vs. limit=10.0 2023-10-14 01:56:02,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.76 vs. limit=15.0 2023-10-14 01:56:24,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1559324.6666666667, ans=0.125 2023-10-14 01:56:48,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1559371.3333333333, ans=0.2 2023-10-14 01:57:07,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1559418.0, ans=0.125 2023-10-14 01:57:08,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.761e+02 1.975e+02 2.221e+02 3.023e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-14 01:58:15,143 INFO [train.py:1031] (3/4) Epoch 25, batch 6500, loss[loss=0.1931, simple_loss=0.2923, pruned_loss=0.04698, over 16856.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2795, pruned_loss=0.04813, over 31533559.15 frames. ], batch size: 155, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:58:27,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1559698.0, ans=0.0 2023-10-14 01:58:29,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1559744.6666666667, ans=0.125 2023-10-14 01:58:40,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-10-14 01:58:50,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1559791.3333333333, ans=0.125 2023-10-14 01:58:52,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1559791.3333333333, ans=0.0 2023-10-14 01:59:15,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.80 vs. limit=15.0 2023-10-14 01:59:19,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-10-14 01:59:29,243 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-10-14 01:59:37,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1559931.3333333333, ans=0.2 2023-10-14 01:59:37,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.812e+02 2.001e+02 2.241e+02 2.998e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-14 01:59:49,335 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.42 vs. limit=10.0 2023-10-14 02:00:29,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-14 02:00:47,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560118.0, ans=0.1 2023-10-14 02:00:47,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.85 vs. limit=15.0 2023-10-14 02:00:55,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1560164.6666666667, ans=0.125 2023-10-14 02:01:16,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1560211.3333333333, ans=0.0 2023-10-14 02:01:25,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1560258.0, ans=0.125 2023-10-14 02:01:28,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560258.0, ans=0.1 2023-10-14 02:01:43,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-10-14 02:01:55,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.833e+02 2.027e+02 2.262e+02 3.241e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-14 02:02:12,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1560444.6666666667, ans=0.125 2023-10-14 02:02:32,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1560538.0, ans=0.2 2023-10-14 02:03:25,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1560724.6666666667, ans=0.125 2023-10-14 02:03:50,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1560818.0, ans=0.0 2023-10-14 02:03:51,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560818.0, ans=0.1 2023-10-14 02:04:00,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1560818.0, ans=0.035 2023-10-14 02:04:03,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1560818.0, ans=0.1 2023-10-14 02:04:06,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.701e+02 1.868e+02 2.039e+02 2.419e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-14 02:04:08,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1560864.6666666667, ans=0.125 2023-10-14 02:04:16,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1560864.6666666667, ans=0.0 2023-10-14 02:04:26,646 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:04:29,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1560911.3333333333, ans=0.125 2023-10-14 02:04:32,655 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=15.0 2023-10-14 02:04:46,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1561004.6666666667, ans=0.125 2023-10-14 02:05:09,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1561051.3333333333, ans=0.125 2023-10-14 02:05:15,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1561051.3333333333, ans=0.0 2023-10-14 02:05:20,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1561098.0, ans=0.015 2023-10-14 02:05:55,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1561191.3333333333, ans=0.0 2023-10-14 02:06:20,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1561238.0, ans=0.2 2023-10-14 02:06:43,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.735e+02 1.865e+02 2.148e+02 3.287e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-14 02:06:58,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1561378.0, ans=0.0 2023-10-14 02:07:02,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1561378.0, ans=0.1 2023-10-14 02:07:31,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1561471.3333333333, ans=0.125 2023-10-14 02:08:09,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1561611.3333333333, ans=0.125 2023-10-14 02:08:36,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1561704.6666666667, ans=0.5 2023-10-14 02:08:45,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=15.0 2023-10-14 02:08:52,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=15.0 2023-10-14 02:08:52,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.780e+02 1.948e+02 2.216e+02 3.561e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-14 02:08:59,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1561798.0, ans=0.0 2023-10-14 02:09:25,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561891.3333333333, ans=0.1 2023-10-14 02:09:27,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1561938.0, ans=0.125 2023-10-14 02:09:27,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1561938.0, ans=0.125 2023-10-14 02:09:43,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1561984.6666666667, ans=0.09899494936611666 2023-10-14 02:09:52,109 INFO [train.py:1031] (3/4) Epoch 25, batch 7000, loss[loss=0.1936, simple_loss=0.2862, pruned_loss=0.05048, over 16802.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.28, pruned_loss=0.04789, over 31836550.72 frames. ], batch size: 81, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 02:09:57,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1562031.3333333333, ans=0.2 2023-10-14 02:10:01,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1562031.3333333333, ans=10.0 2023-10-14 02:10:17,142 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:10:56,519 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-14 02:10:58,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1562264.6666666667, ans=0.125 2023-10-14 02:11:00,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.45 vs. limit=15.0 2023-10-14 02:11:01,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.797e+02 1.946e+02 2.145e+02 2.799e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-14 02:11:24,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1562358.0, ans=0.0 2023-10-14 02:12:15,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1562498.0, ans=0.2 2023-10-14 02:12:19,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1562498.0, ans=0.0 2023-10-14 02:12:30,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1562544.6666666667, ans=0.2 2023-10-14 02:12:32,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-14 02:12:34,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1562591.3333333333, ans=0.0 2023-10-14 02:12:55,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.89 vs. limit=15.0 2023-10-14 02:13:17,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1562731.3333333333, ans=0.0 2023-10-14 02:13:23,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.875e+02 1.996e+02 2.168e+02 2.790e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-14 02:13:23,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1562731.3333333333, ans=0.1 2023-10-14 02:13:50,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1562824.6666666667, ans=0.125 2023-10-14 02:13:50,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1562824.6666666667, ans=0.1 2023-10-14 02:14:29,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1562964.6666666667, ans=0.0 2023-10-14 02:14:42,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1563011.3333333333, ans=0.125 2023-10-14 02:15:23,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1563104.6666666667, ans=0.0 2023-10-14 02:15:23,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=22.5 2023-10-14 02:15:24,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563104.6666666667, ans=0.1 2023-10-14 02:15:26,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563104.6666666667, ans=0.1 2023-10-14 02:15:36,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1563151.3333333333, ans=0.125 2023-10-14 02:15:43,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1563151.3333333333, ans=0.125 2023-10-14 02:15:49,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=15.0 2023-10-14 02:15:54,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.797e+02 1.967e+02 2.100e+02 2.869e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-14 02:16:00,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1563198.0, ans=0.0 2023-10-14 02:16:12,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1563244.6666666667, ans=0.5 2023-10-14 02:16:24,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563291.3333333333, ans=0.1 2023-10-14 02:16:25,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1563291.3333333333, ans=0.0 2023-10-14 02:16:26,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563291.3333333333, ans=0.1 2023-10-14 02:16:28,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1563291.3333333333, ans=0.125 2023-10-14 02:16:33,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1563338.0, ans=0.125 2023-10-14 02:16:47,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1563384.6666666667, ans=0.125 2023-10-14 02:16:47,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-14 02:16:51,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1563384.6666666667, ans=0.2 2023-10-14 02:17:05,522 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:17:09,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1563431.3333333333, ans=0.125 2023-10-14 02:17:16,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1563478.0, ans=0.0 2023-10-14 02:17:27,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1563524.6666666667, ans=0.0 2023-10-14 02:17:35,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1563524.6666666667, ans=0.125 2023-10-14 02:17:44,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1563571.3333333333, ans=0.125 2023-10-14 02:18:14,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.906e+02 2.098e+02 2.525e+02 3.342e+02, threshold=4.197e+02, percent-clipped=0.0 2023-10-14 02:18:19,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1563664.6666666667, ans=0.95 2023-10-14 02:18:26,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1563711.3333333333, ans=0.0 2023-10-14 02:18:39,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1563758.0, ans=0.125 2023-10-14 02:18:41,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1563758.0, ans=0.05 2023-10-14 02:18:41,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1563758.0, ans=10.0 2023-10-14 02:19:00,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1563804.6666666667, ans=0.125 2023-10-14 02:19:00,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1563804.6666666667, ans=0.0 2023-10-14 02:19:06,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1563851.3333333333, ans=0.125 2023-10-14 02:19:16,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1563898.0, ans=0.2 2023-10-14 02:19:21,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1563898.0, ans=0.09899494936611666 2023-10-14 02:19:25,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1563898.0, ans=0.125 2023-10-14 02:19:29,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-10-14 02:19:43,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=22.5 2023-10-14 02:19:58,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1564038.0, ans=0.2 2023-10-14 02:20:03,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1564038.0, ans=0.125 2023-10-14 02:20:25,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.857e+02 1.973e+02 2.145e+02 3.193e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-14 02:20:46,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1564178.0, ans=0.125 2023-10-14 02:20:59,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-10-14 02:21:05,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=22.5 2023-10-14 02:21:05,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1564271.3333333333, ans=0.125 2023-10-14 02:21:06,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.26 vs. limit=15.0 2023-10-14 02:21:10,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1564318.0, ans=0.125 2023-10-14 02:21:23,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1564318.0, ans=0.0 2023-10-14 02:21:28,762 INFO [train.py:1031] (3/4) Epoch 25, batch 7500, loss[loss=0.2301, simple_loss=0.3007, pruned_loss=0.0797, over 15508.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2797, pruned_loss=0.04789, over 32035834.35 frames. ], batch size: 350, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 02:21:53,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1564458.0, ans=0.1 2023-10-14 02:22:05,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564504.6666666667, ans=0.1 2023-10-14 02:22:06,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1564504.6666666667, ans=0.035 2023-10-14 02:22:08,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564504.6666666667, ans=0.1 2023-10-14 02:22:17,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-10-14 02:22:22,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1564551.3333333333, ans=0.0 2023-10-14 02:22:27,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1564551.3333333333, ans=0.2 2023-10-14 02:22:30,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1564551.3333333333, ans=0.125 2023-10-14 02:22:39,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.803e+02 1.977e+02 2.133e+02 3.002e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-14 02:22:47,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1564598.0, ans=0.2 2023-10-14 02:22:47,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-10-14 02:22:59,913 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.091e-02 2023-10-14 02:23:15,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1564691.3333333333, ans=0.1 2023-10-14 02:23:33,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1564738.0, ans=0.2 2023-10-14 02:23:46,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1564831.3333333333, ans=0.0 2023-10-14 02:23:48,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1564831.3333333333, ans=0.025 2023-10-14 02:23:49,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1564831.3333333333, ans=0.0 2023-10-14 02:23:51,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1564831.3333333333, ans=0.2 2023-10-14 02:24:29,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.23 vs. limit=22.5 2023-10-14 02:24:30,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1564924.6666666667, ans=0.125 2023-10-14 02:24:55,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1565018.0, ans=0.1 2023-10-14 02:25:12,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.822e+02 1.958e+02 2.209e+02 3.104e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-14 02:25:23,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1565111.3333333333, ans=0.2 2023-10-14 02:25:26,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1565111.3333333333, ans=0.0 2023-10-14 02:25:26,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1565111.3333333333, ans=0.125 2023-10-14 02:25:30,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1565111.3333333333, ans=0.1 2023-10-14 02:25:51,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1565204.6666666667, ans=0.125 2023-10-14 02:25:54,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1565204.6666666667, ans=0.125 2023-10-14 02:25:56,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1565204.6666666667, ans=0.125 2023-10-14 02:26:31,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1565344.6666666667, ans=0.125 2023-10-14 02:26:46,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1565391.3333333333, ans=0.125 2023-10-14 02:26:54,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1565438.0, ans=0.0 2023-10-14 02:27:22,625 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-10-14 02:27:24,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.782e+02 1.917e+02 2.122e+02 2.951e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-14 02:27:33,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1565578.0, ans=0.125 2023-10-14 02:27:34,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1565578.0, ans=0.125 2023-10-14 02:27:45,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=15.0 2023-10-14 02:28:21,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1565718.0, ans=0.125 2023-10-14 02:28:24,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1565764.6666666667, ans=0.125 2023-10-14 02:28:26,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1565764.6666666667, ans=0.125 2023-10-14 02:28:50,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.97 vs. limit=15.0 2023-10-14 02:29:38,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.856e+02 2.075e+02 2.261e+02 3.083e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-14 02:29:46,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1565998.0, ans=0.125 2023-10-14 02:29:46,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-10-14 02:29:56,464 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.00 vs. limit=15.0 2023-10-14 02:29:56,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.23 vs. limit=10.0 2023-10-14 02:30:05,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1566091.3333333333, ans=0.0 2023-10-14 02:30:05,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1566091.3333333333, ans=0.0 2023-10-14 02:30:28,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1566184.6666666667, ans=0.04949747468305833 2023-10-14 02:30:30,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1566184.6666666667, ans=0.0 2023-10-14 02:30:30,599 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.96 vs. limit=10.0 2023-10-14 02:30:46,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1566231.3333333333, ans=0.0 2023-10-14 02:31:09,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1566324.6666666667, ans=0.025 2023-10-14 02:31:25,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1566371.3333333333, ans=0.05 2023-10-14 02:31:32,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1566418.0, ans=0.2 2023-10-14 02:31:49,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.724e+02 1.889e+02 2.062e+02 3.073e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-14 02:32:17,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1566558.0, ans=0.0 2023-10-14 02:32:49,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1566651.3333333333, ans=0.0 2023-10-14 02:32:50,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-10-14 02:32:51,047 INFO [train.py:1031] (3/4) Epoch 25, batch 8000, loss[loss=0.1936, simple_loss=0.2859, pruned_loss=0.05072, over 16449.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2791, pruned_loss=0.04741, over 32180714.70 frames. ], batch size: 266, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 02:32:51,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1566698.0, ans=0.0 2023-10-14 02:32:53,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1566698.0, ans=0.125 2023-10-14 02:33:06,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1566744.6666666667, ans=0.125 2023-10-14 02:33:11,432 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:33:14,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1566744.6666666667, ans=0.025 2023-10-14 02:33:24,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1566791.3333333333, ans=0.1 2023-10-14 02:33:26,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1566791.3333333333, ans=0.125 2023-10-14 02:33:50,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1566884.6666666667, ans=0.125 2023-10-14 02:34:00,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.696e+02 1.934e+02 2.291e+02 3.190e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-14 02:34:07,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1566931.3333333333, ans=0.125 2023-10-14 02:34:08,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1566931.3333333333, ans=0.0 2023-10-14 02:35:14,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1567164.6666666667, ans=0.125 2023-10-14 02:35:15,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.75 vs. limit=22.5 2023-10-14 02:35:42,846 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=12.0 2023-10-14 02:35:43,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1567258.0, ans=0.0 2023-10-14 02:35:47,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1567258.0, ans=0.125 2023-10-14 02:36:02,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-10-14 02:36:16,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-10-14 02:36:25,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1567398.0, ans=0.2 2023-10-14 02:36:31,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.825e+02 1.991e+02 2.215e+02 3.886e+02, threshold=3.981e+02, percent-clipped=1.0 2023-10-14 02:36:57,215 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=22.5 2023-10-14 02:37:40,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1567584.6666666667, ans=0.2 2023-10-14 02:37:43,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1567584.6666666667, ans=0.0 2023-10-14 02:37:52,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1567631.3333333333, ans=0.0 2023-10-14 02:38:01,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1567631.3333333333, ans=0.2 2023-10-14 02:38:08,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1567678.0, ans=0.125 2023-10-14 02:38:18,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-14 02:38:19,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1567678.0, ans=0.125 2023-10-14 02:38:44,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1567771.3333333333, ans=0.125 2023-10-14 02:39:14,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1567864.6666666667, ans=0.1 2023-10-14 02:39:16,039 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:39:18,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.772e+02 1.939e+02 2.097e+02 3.138e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-14 02:39:21,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1567864.6666666667, ans=0.0 2023-10-14 02:39:26,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1567864.6666666667, ans=0.0 2023-10-14 02:39:40,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1567911.3333333333, ans=0.0 2023-10-14 02:39:42,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1567958.0, ans=0.2 2023-10-14 02:39:45,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1567958.0, ans=0.2 2023-10-14 02:39:49,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1567958.0, ans=0.035 2023-10-14 02:40:06,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1568004.6666666667, ans=0.04949747468305833 2023-10-14 02:40:33,462 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-14 02:41:00,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=22.5 2023-10-14 02:41:08,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-10-14 02:41:45,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1568284.6666666667, ans=0.125 2023-10-14 02:41:47,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1568284.6666666667, ans=0.125 2023-10-14 02:41:58,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.809e+02 1.989e+02 2.246e+02 3.073e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-14 02:42:01,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=15.0 2023-10-14 02:42:03,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568331.3333333333, ans=0.1 2023-10-14 02:42:15,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1568378.0, ans=0.035 2023-10-14 02:42:43,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1568471.3333333333, ans=0.125 2023-10-14 02:42:46,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=12.0 2023-10-14 02:43:04,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1568518.0, ans=0.125 2023-10-14 02:43:28,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1568564.6666666667, ans=0.125 2023-10-14 02:43:32,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1568564.6666666667, ans=0.125 2023-10-14 02:44:00,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1568658.0, ans=0.2 2023-10-14 02:44:24,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1568704.6666666667, ans=0.125 2023-10-14 02:44:52,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1568751.3333333333, ans=0.2 2023-10-14 02:45:09,020 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.829e+02 1.947e+02 2.133e+02 3.028e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-14 02:45:50,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1568891.3333333333, ans=0.2 2023-10-14 02:46:03,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1568938.0, ans=0.125 2023-10-14 02:46:15,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1568984.6666666667, ans=0.125 2023-10-14 02:46:33,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1568984.6666666667, ans=0.125 2023-10-14 02:46:36,424 INFO [train.py:1031] (3/4) Epoch 25, batch 8500, loss[loss=0.1783, simple_loss=0.274, pruned_loss=0.04134, over 16376.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2792, pruned_loss=0.0472, over 32315205.54 frames. ], batch size: 50, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 02:46:49,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1569031.3333333333, ans=0.125 2023-10-14 02:46:59,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569078.0, ans=0.125 2023-10-14 02:47:04,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1569078.0, ans=0.025 2023-10-14 02:47:19,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1569124.6666666667, ans=0.0 2023-10-14 02:47:27,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1569124.6666666667, ans=0.1 2023-10-14 02:47:39,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.65 vs. limit=15.0 2023-10-14 02:47:45,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1569171.3333333333, ans=0.0 2023-10-14 02:48:02,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-10-14 02:48:08,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1569218.0, ans=0.1 2023-10-14 02:48:28,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.918e+02 2.165e+02 2.461e+02 3.318e+02, threshold=4.331e+02, percent-clipped=0.0 2023-10-14 02:49:08,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1569358.0, ans=0.2 2023-10-14 02:49:08,791 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-14 02:49:25,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1569358.0, ans=10.0 2023-10-14 02:50:05,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1569451.3333333333, ans=0.0 2023-10-14 02:51:13,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1569638.0, ans=0.5 2023-10-14 02:51:14,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1569638.0, ans=0.0 2023-10-14 02:51:45,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1569731.3333333333, ans=0.5 2023-10-14 02:51:53,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.716e+02 1.941e+02 2.181e+02 3.660e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-14 02:52:14,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1569778.0, ans=0.2 2023-10-14 02:53:04,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1569871.3333333333, ans=0.2 2023-10-14 02:53:07,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.18 vs. limit=15.0 2023-10-14 02:54:46,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1570058.0, ans=0.125 2023-10-14 02:55:25,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1570151.3333333333, ans=0.0 2023-10-14 02:55:36,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1570151.3333333333, ans=0.0 2023-10-14 02:55:43,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1570151.3333333333, ans=0.2 2023-10-14 02:56:01,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570198.0, ans=0.1 2023-10-14 02:56:02,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1570198.0, ans=0.125 2023-10-14 02:56:02,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.704e+02 1.849e+02 2.088e+02 2.857e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-14 02:56:39,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1570291.3333333333, ans=0.2 2023-10-14 02:56:44,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1570291.3333333333, ans=0.125 2023-10-14 02:56:44,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1570291.3333333333, ans=0.05 2023-10-14 02:57:05,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1570338.0, ans=0.2 2023-10-14 02:57:05,639 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.98 vs. limit=15.0 2023-10-14 02:57:26,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1570338.0, ans=0.125 2023-10-14 02:57:49,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-10-14 02:58:13,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1570478.0, ans=0.0 2023-10-14 02:58:16,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1570478.0, ans=0.125 2023-10-14 02:58:20,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1570478.0, ans=0.1 2023-10-14 02:58:34,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1570524.6666666667, ans=0.125 2023-10-14 02:58:45,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1570571.3333333333, ans=0.0 2023-10-14 02:59:00,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1570618.0, ans=0.125 2023-10-14 02:59:07,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1570664.6666666667, ans=0.125 2023-10-14 02:59:14,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.725e+02 1.893e+02 2.096e+02 2.628e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-14 02:59:17,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1570664.6666666667, ans=15.0 2023-10-14 03:00:22,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1570944.6666666667, ans=0.07 2023-10-14 03:00:24,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1570944.6666666667, ans=10.0 2023-10-14 03:00:39,212 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.30 vs. limit=22.5 2023-10-14 03:00:40,043 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:00:51,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1571084.6666666667, ans=0.125 2023-10-14 03:00:52,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2023-10-14 03:01:05,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.804e+02 1.975e+02 2.129e+02 2.551e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 03:01:06,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1571131.3333333333, ans=0.0 2023-10-14 03:01:08,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=15.0 2023-10-14 03:01:16,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1571178.0, ans=0.04949747468305833 2023-10-14 03:01:32,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1571271.3333333333, ans=0.1 2023-10-14 03:01:48,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1571318.0, ans=0.125 2023-10-14 03:01:48,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1571318.0, ans=0.2 2023-10-14 03:01:54,982 INFO [train.py:1031] (3/4) Epoch 25, batch 9000, loss[loss=0.1896, simple_loss=0.293, pruned_loss=0.0431, over 16782.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2786, pruned_loss=0.04698, over 32438531.70 frames. ], batch size: 188, lr: 1.37e-03, grad_scale: 8.0 2023-10-14 03:02:15,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1571411.3333333333, ans=0.125 2023-10-14 03:02:25,680 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:02:29,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1571458.0, ans=0.125 2023-10-14 03:02:35,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1571504.6666666667, ans=0.025 2023-10-14 03:02:48,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1571551.3333333333, ans=0.07 2023-10-14 03:02:56,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1571598.0, ans=0.0 2023-10-14 03:02:58,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.804e+02 2.003e+02 2.238e+02 3.238e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-14 03:03:01,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1571598.0, ans=0.125 2023-10-14 03:03:07,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1571644.6666666667, ans=0.025 2023-10-14 03:03:14,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1571691.3333333333, ans=0.125 2023-10-14 03:03:20,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.39 vs. limit=10.0 2023-10-14 03:03:42,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1571784.6666666667, ans=0.125 2023-10-14 03:03:42,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1571784.6666666667, ans=0.0 2023-10-14 03:03:50,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1571831.3333333333, ans=0.125 2023-10-14 03:03:53,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1571831.3333333333, ans=0.1 2023-10-14 03:04:01,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1571878.0, ans=0.2 2023-10-14 03:04:09,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1571924.6666666667, ans=0.125 2023-10-14 03:04:41,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1572064.6666666667, ans=0.2 2023-10-14 03:04:48,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.751e+02 1.880e+02 2.112e+02 2.668e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-14 03:04:52,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1572111.3333333333, ans=0.1 2023-10-14 03:05:35,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1572298.0, ans=0.0 2023-10-14 03:05:35,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1572298.0, ans=0.125 2023-10-14 03:05:37,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.83 vs. limit=12.0 2023-10-14 03:05:42,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1572298.0, ans=0.125 2023-10-14 03:05:51,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1572344.6666666667, ans=0.0 2023-10-14 03:05:53,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1572344.6666666667, ans=0.2 2023-10-14 03:05:56,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1572391.3333333333, ans=0.2 2023-10-14 03:06:01,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-14 03:06:02,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.66 vs. limit=22.5 2023-10-14 03:06:11,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1572438.0, ans=0.125 2023-10-14 03:06:36,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.842e+02 2.044e+02 2.450e+02 3.527e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-14 03:06:45,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1572578.0, ans=0.125 2023-10-14 03:07:11,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1572671.3333333333, ans=0.125 2023-10-14 03:07:35,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1572811.3333333333, ans=0.1 2023-10-14 03:07:39,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1572811.3333333333, ans=0.125 2023-10-14 03:07:40,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1572811.3333333333, ans=0.1 2023-10-14 03:07:49,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-14 03:08:14,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1572951.3333333333, ans=0.0 2023-10-14 03:08:31,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.853e+02 2.051e+02 2.272e+02 2.849e+02, threshold=4.102e+02, percent-clipped=0.0 2023-10-14 03:08:44,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-14 03:08:53,583 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=22.5 2023-10-14 03:08:54,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1573091.3333333333, ans=0.125 2023-10-14 03:08:59,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1573091.3333333333, ans=0.5 2023-10-14 03:08:59,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1573091.3333333333, ans=0.1 2023-10-14 03:09:16,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1573184.6666666667, ans=0.125 2023-10-14 03:09:22,187 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.48 vs. limit=15.0 2023-10-14 03:09:26,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1573184.6666666667, ans=0.0 2023-10-14 03:09:44,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=12.0 2023-10-14 03:09:55,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1573324.6666666667, ans=0.1 2023-10-14 03:10:01,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1573371.3333333333, ans=0.04949747468305833 2023-10-14 03:10:04,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-10-14 03:10:09,079 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:10:20,870 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-10-14 03:10:25,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1573464.6666666667, ans=0.2 2023-10-14 03:10:27,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1573464.6666666667, ans=0.125 2023-10-14 03:10:33,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.832e+02 2.006e+02 2.217e+02 3.200e+02, threshold=4.013e+02, percent-clipped=0.0 2023-10-14 03:11:09,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1573604.6666666667, ans=0.07 2023-10-14 03:11:27,344 INFO [train.py:1031] (3/4) Epoch 25, batch 9500, loss[loss=0.181, simple_loss=0.2751, pruned_loss=0.04345, over 15343.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2794, pruned_loss=0.04726, over 32525776.59 frames. ], batch size: 35, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 03:11:57,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1573791.3333333333, ans=0.125 2023-10-14 03:12:12,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1573884.6666666667, ans=0.0 2023-10-14 03:12:18,313 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:12:29,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1573931.3333333333, ans=0.125 2023-10-14 03:12:30,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.841e+02 1.997e+02 2.236e+02 2.993e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 03:12:37,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1573978.0, ans=0.0 2023-10-14 03:12:47,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2023-10-14 03:13:04,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.56 vs. limit=22.5 2023-10-14 03:13:09,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574118.0, ans=0.1 2023-10-14 03:13:13,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1574118.0, ans=0.1 2023-10-14 03:13:41,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=1574211.3333333333, ans=6.0 2023-10-14 03:13:48,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.47 vs. limit=22.5 2023-10-14 03:13:53,894 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:13:56,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1574304.6666666667, ans=0.125 2023-10-14 03:14:05,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1574351.3333333333, ans=0.0 2023-10-14 03:14:24,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.795e+02 1.927e+02 2.122e+02 2.768e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-14 03:14:39,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574491.3333333333, ans=0.1 2023-10-14 03:15:09,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1574584.6666666667, ans=0.0 2023-10-14 03:15:10,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1574584.6666666667, ans=0.125 2023-10-14 03:15:12,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1574631.3333333333, ans=0.125 2023-10-14 03:15:15,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1574631.3333333333, ans=0.1 2023-10-14 03:15:27,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1574678.0, ans=0.0 2023-10-14 03:15:28,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1574678.0, ans=0.125 2023-10-14 03:15:36,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1574724.6666666667, ans=0.1 2023-10-14 03:16:02,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1574818.0, ans=0.125 2023-10-14 03:16:07,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574818.0, ans=0.1 2023-10-14 03:16:11,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1574864.6666666667, ans=0.125 2023-10-14 03:16:17,350 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.782e+02 1.972e+02 2.170e+02 3.443e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 03:17:24,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1575144.6666666667, ans=0.2 2023-10-14 03:17:27,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-10-14 03:17:37,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1575191.3333333333, ans=0.95 2023-10-14 03:18:03,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1575331.3333333333, ans=0.2 2023-10-14 03:18:10,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.733e+02 1.922e+02 2.137e+02 3.004e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-14 03:18:22,109 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-14 03:18:27,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1575424.6666666667, ans=0.1 2023-10-14 03:18:48,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-14 03:18:48,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.54 vs. limit=6.0 2023-10-14 03:19:00,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1575564.6666666667, ans=10.0 2023-10-14 03:19:01,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1575564.6666666667, ans=0.0 2023-10-14 03:19:06,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1575564.6666666667, ans=0.02 2023-10-14 03:19:23,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1575658.0, ans=0.2 2023-10-14 03:19:27,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1575658.0, ans=0.125 2023-10-14 03:19:29,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1575658.0, ans=0.1 2023-10-14 03:19:54,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1575798.0, ans=0.2 2023-10-14 03:19:57,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1575798.0, ans=0.1 2023-10-14 03:20:00,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.788e+02 1.939e+02 2.124e+02 2.768e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-14 03:20:00,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1575798.0, ans=0.04949747468305833 2023-10-14 03:20:23,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1575938.0, ans=0.125 2023-10-14 03:20:32,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1575938.0, ans=0.0 2023-10-14 03:20:42,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1575984.6666666667, ans=0.1 2023-10-14 03:20:44,441 INFO [train.py:1031] (3/4) Epoch 25, batch 10000, loss[loss=0.1885, simple_loss=0.278, pruned_loss=0.0495, over 16392.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2786, pruned_loss=0.04706, over 32576414.35 frames. ], batch size: 50, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 03:20:48,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.42 vs. limit=6.0 2023-10-14 03:20:50,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=1576031.3333333333, ans=0.1 2023-10-14 03:20:55,474 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1576078.0, ans=0.125 2023-10-14 03:21:06,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1576124.6666666667, ans=0.125 2023-10-14 03:21:20,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1576171.3333333333, ans=0.0 2023-10-14 03:21:29,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576218.0, ans=0.1 2023-10-14 03:21:42,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1576264.6666666667, ans=0.125 2023-10-14 03:21:44,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.836e+02 2.001e+02 2.218e+02 3.537e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-14 03:21:44,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.34 vs. limit=15.0 2023-10-14 03:21:56,474 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:22:03,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1576358.0, ans=0.035 2023-10-14 03:22:21,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=22.5 2023-10-14 03:22:25,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-10-14 03:22:48,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=22.5 2023-10-14 03:23:01,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1576591.3333333333, ans=0.125 2023-10-14 03:23:01,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2023-10-14 03:23:10,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1576638.0, ans=10.0 2023-10-14 03:23:25,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=12.0 2023-10-14 03:23:28,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1576731.3333333333, ans=0.125 2023-10-14 03:23:37,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.807e+02 1.989e+02 2.260e+02 3.052e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 03:23:37,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1576731.3333333333, ans=0.0 2023-10-14 03:23:45,670 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:24:06,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1576871.3333333333, ans=0.125 2023-10-14 03:24:34,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-14 03:24:40,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1577011.3333333333, ans=0.125 2023-10-14 03:24:42,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1577011.3333333333, ans=0.125 2023-10-14 03:24:43,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1577011.3333333333, ans=0.2 2023-10-14 03:24:46,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1577011.3333333333, ans=0.125 2023-10-14 03:25:28,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.828e+02 2.014e+02 2.193e+02 3.297e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-14 03:25:28,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1577198.0, ans=0.0 2023-10-14 03:25:32,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1577244.6666666667, ans=0.125 2023-10-14 03:25:36,742 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.12 vs. limit=15.0 2023-10-14 03:25:58,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1577338.0, ans=0.0 2023-10-14 03:26:01,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1577338.0, ans=0.0 2023-10-14 03:26:01,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1577338.0, ans=0.125 2023-10-14 03:26:01,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-10-14 03:26:04,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=8.0 2023-10-14 03:26:11,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2023-10-14 03:26:12,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=15.0 2023-10-14 03:26:37,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1577478.0, ans=0.0 2023-10-14 03:26:47,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1577524.6666666667, ans=0.125 2023-10-14 03:26:51,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1577571.3333333333, ans=0.1 2023-10-14 03:26:55,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1577571.3333333333, ans=0.0 2023-10-14 03:26:57,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=22.5 2023-10-14 03:27:12,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1577664.6666666667, ans=0.125 2023-10-14 03:27:23,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.846e+02 1.968e+02 2.166e+02 2.797e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 03:27:32,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1577711.3333333333, ans=0.125 2023-10-14 03:27:41,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2023-10-14 03:27:53,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1577804.6666666667, ans=0.0 2023-10-14 03:28:15,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-10-14 03:28:33,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1577991.3333333333, ans=0.0 2023-10-14 03:28:36,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1577991.3333333333, ans=0.125 2023-10-14 03:28:39,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1577991.3333333333, ans=0.0 2023-10-14 03:29:13,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1578131.3333333333, ans=0.1 2023-10-14 03:29:15,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.747e+02 1.890e+02 2.109e+02 2.629e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-14 03:29:25,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1578178.0, ans=0.1 2023-10-14 03:29:27,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1578178.0, ans=0.125 2023-10-14 03:29:51,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1578318.0, ans=0.0 2023-10-14 03:29:55,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1578318.0, ans=0.05 2023-10-14 03:30:01,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1578318.0, ans=0.0 2023-10-14 03:30:02,939 INFO [train.py:1031] (3/4) Epoch 25, batch 10500, loss[loss=0.1881, simple_loss=0.2818, pruned_loss=0.04721, over 16703.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2791, pruned_loss=0.04724, over 32628066.09 frames. ], batch size: 202, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 03:30:13,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=15.0 2023-10-14 03:30:25,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1578458.0, ans=0.125 2023-10-14 03:30:34,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1578504.6666666667, ans=0.125 2023-10-14 03:30:48,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1578551.3333333333, ans=0.1 2023-10-14 03:30:53,550 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1578551.3333333333, ans=0.2 2023-10-14 03:31:04,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.947e+02 2.191e+02 2.473e+02 3.920e+02, threshold=4.382e+02, percent-clipped=1.0 2023-10-14 03:31:07,586 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1578644.6666666667, ans=0.0 2023-10-14 03:31:09,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1578644.6666666667, ans=0.2 2023-10-14 03:31:16,738 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:31:18,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1578691.3333333333, ans=0.125 2023-10-14 03:31:22,385 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:31:38,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1578738.0, ans=0.2 2023-10-14 03:31:44,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.87 vs. limit=10.0 2023-10-14 03:31:46,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1578784.6666666667, ans=0.0 2023-10-14 03:32:06,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1578831.3333333333, ans=0.0 2023-10-14 03:32:41,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-10-14 03:32:47,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1579018.0, ans=0.0 2023-10-14 03:32:53,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1579064.6666666667, ans=0.04949747468305833 2023-10-14 03:32:58,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=1579064.6666666667, ans=15.0 2023-10-14 03:33:03,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.861e+02 1.992e+02 2.125e+02 2.838e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-14 03:33:12,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1579111.3333333333, ans=0.125 2023-10-14 03:33:17,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1579158.0, ans=0.125 2023-10-14 03:33:33,434 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-10-14 03:34:00,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.73 vs. limit=15.0 2023-10-14 03:34:09,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1579344.6666666667, ans=0.125 2023-10-14 03:34:19,492 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-10-14 03:34:24,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1579438.0, ans=0.0 2023-10-14 03:34:27,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579438.0, ans=0.1 2023-10-14 03:34:27,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1579438.0, ans=0.125 2023-10-14 03:34:27,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1579438.0, ans=0.2 2023-10-14 03:34:52,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1579531.3333333333, ans=0.0 2023-10-14 03:34:53,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1579531.3333333333, ans=0.125 2023-10-14 03:34:54,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.813e+02 1.997e+02 2.193e+02 3.314e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-14 03:34:58,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-10-14 03:35:11,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1579624.6666666667, ans=0.0 2023-10-14 03:35:14,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1579624.6666666667, ans=0.125 2023-10-14 03:35:17,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1579624.6666666667, ans=0.125 2023-10-14 03:35:33,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1579718.0, ans=0.04949747468305833 2023-10-14 03:35:42,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579764.6666666667, ans=0.1 2023-10-14 03:35:46,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-10-14 03:35:50,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1579811.3333333333, ans=0.2 2023-10-14 03:36:06,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1579858.0, ans=0.125 2023-10-14 03:36:07,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1579858.0, ans=0.5 2023-10-14 03:36:07,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1579858.0, ans=0.5 2023-10-14 03:36:40,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1579998.0, ans=0.1 2023-10-14 03:36:41,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.897e+02 2.045e+02 2.310e+02 3.292e+02, threshold=4.089e+02, percent-clipped=0.0 2023-10-14 03:36:42,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-10-14 03:37:02,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1580091.3333333333, ans=0.2 2023-10-14 03:37:10,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1580138.0, ans=0.0 2023-10-14 03:37:28,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1580184.6666666667, ans=0.125 2023-10-14 03:37:44,768 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.97 vs. limit=15.0 2023-10-14 03:37:47,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1580278.0, ans=0.1 2023-10-14 03:38:02,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1580371.3333333333, ans=0.07 2023-10-14 03:38:07,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1580371.3333333333, ans=0.0 2023-10-14 03:38:14,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1580418.0, ans=0.125 2023-10-14 03:38:19,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1580418.0, ans=0.0 2023-10-14 03:38:23,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.73 vs. limit=22.5 2023-10-14 03:38:24,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1580464.6666666667, ans=0.125 2023-10-14 03:38:27,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1580464.6666666667, ans=0.125 2023-10-14 03:38:35,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.722e+02 1.877e+02 2.049e+02 3.003e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-14 03:38:42,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1580511.3333333333, ans=0.05 2023-10-14 03:38:47,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580558.0, ans=0.1 2023-10-14 03:38:51,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1580558.0, ans=10.0 2023-10-14 03:39:10,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1580651.3333333333, ans=0.125 2023-10-14 03:39:10,918 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:39:13,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-10-14 03:39:18,135 INFO [train.py:1031] (3/4) Epoch 25, batch 11000, loss[loss=0.2323, simple_loss=0.3031, pruned_loss=0.08076, over 15670.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.279, pruned_loss=0.0472, over 32651942.76 frames. ], batch size: 350, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 03:39:19,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1580698.0, ans=0.125 2023-10-14 03:39:27,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.51 vs. limit=6.0 2023-10-14 03:39:42,977 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:39:50,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1580838.0, ans=0.1 2023-10-14 03:39:51,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1580838.0, ans=0.125 2023-10-14 03:39:54,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1580838.0, ans=0.125 2023-10-14 03:40:09,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1580884.6666666667, ans=0.04949747468305833 2023-10-14 03:40:21,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1580931.3333333333, ans=0.0 2023-10-14 03:40:25,031 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.890e+02 2.048e+02 2.245e+02 3.372e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 03:40:31,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1580978.0, ans=0.125 2023-10-14 03:40:43,692 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-14 03:41:31,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.41 vs. limit=15.0 2023-10-14 03:41:38,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-14 03:41:50,035 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-14 03:41:50,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1581304.6666666667, ans=0.0 2023-10-14 03:42:20,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1581398.0, ans=0.2 2023-10-14 03:42:23,244 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=22.5 2023-10-14 03:42:25,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1581398.0, ans=0.125 2023-10-14 03:42:28,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.716e+02 1.873e+02 2.034e+02 2.622e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-14 03:42:29,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1581444.6666666667, ans=0.125 2023-10-14 03:42:37,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-10-14 03:42:41,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1581491.3333333333, ans=0.0 2023-10-14 03:42:42,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=12.0 2023-10-14 03:42:47,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581491.3333333333, ans=0.1 2023-10-14 03:42:56,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1581538.0, ans=0.0 2023-10-14 03:43:04,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1581584.6666666667, ans=0.125 2023-10-14 03:43:16,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581631.3333333333, ans=0.1 2023-10-14 03:43:26,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-10-14 03:43:37,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1581724.6666666667, ans=0.125 2023-10-14 03:43:37,547 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.75 vs. limit=6.0 2023-10-14 03:43:54,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581771.3333333333, ans=0.1 2023-10-14 03:43:59,412 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-10-14 03:44:07,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1581864.6666666667, ans=0.125 2023-10-14 03:44:13,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1581864.6666666667, ans=0.125 2023-10-14 03:44:18,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1581911.3333333333, ans=0.125 2023-10-14 03:44:18,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.804e+02 1.994e+02 2.206e+02 3.489e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-14 03:44:20,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1581911.3333333333, ans=0.0 2023-10-14 03:44:47,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582004.6666666667, ans=0.1 2023-10-14 03:45:23,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1582144.6666666667, ans=0.125 2023-10-14 03:45:25,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.73 vs. limit=15.0 2023-10-14 03:45:50,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1582238.0, ans=0.125 2023-10-14 03:46:12,659 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1582331.3333333333, ans=0.125 2023-10-14 03:46:17,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=22.5 2023-10-14 03:46:17,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.840e+02 2.018e+02 2.259e+02 3.158e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 03:46:19,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1582378.0, ans=0.035 2023-10-14 03:46:32,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1582424.6666666667, ans=0.0 2023-10-14 03:46:32,295 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.90 vs. limit=10.0 2023-10-14 03:46:40,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1582471.3333333333, ans=0.2 2023-10-14 03:46:51,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582518.0, ans=0.1 2023-10-14 03:46:53,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582518.0, ans=0.1 2023-10-14 03:47:09,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2023-10-14 03:47:15,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1582611.3333333333, ans=0.2 2023-10-14 03:47:21,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1582611.3333333333, ans=0.125 2023-10-14 03:47:23,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1582658.0, ans=0.125 2023-10-14 03:47:28,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1582658.0, ans=0.125 2023-10-14 03:47:29,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1582658.0, ans=0.0 2023-10-14 03:47:41,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1582704.6666666667, ans=0.125 2023-10-14 03:47:44,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1582704.6666666667, ans=0.015 2023-10-14 03:47:53,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1582751.3333333333, ans=0.5 2023-10-14 03:47:54,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1582751.3333333333, ans=0.125 2023-10-14 03:48:02,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.19 vs. limit=10.0 2023-10-14 03:48:11,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.891e+02 2.027e+02 2.213e+02 3.067e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-14 03:48:18,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1582844.6666666667, ans=0.0 2023-10-14 03:48:27,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1582891.3333333333, ans=0.0 2023-10-14 03:48:29,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-10-14 03:48:34,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.76 vs. limit=22.5 2023-10-14 03:48:42,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.99 vs. limit=15.0 2023-10-14 03:48:56,740 INFO [train.py:1031] (3/4) Epoch 25, batch 11500, loss[loss=0.1994, simple_loss=0.2915, pruned_loss=0.05362, over 16902.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2788, pruned_loss=0.04712, over 32673093.16 frames. ], batch size: 82, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 03:48:56,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1583031.3333333333, ans=0.125 2023-10-14 03:49:09,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.49 vs. limit=22.5 2023-10-14 03:49:10,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1583078.0, ans=0.125 2023-10-14 03:49:15,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1583078.0, ans=0.95 2023-10-14 03:49:15,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583078.0, ans=0.1 2023-10-14 03:49:37,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1583171.3333333333, ans=0.04949747468305833 2023-10-14 03:50:03,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.54 vs. limit=15.0 2023-10-14 03:50:03,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=22.5 2023-10-14 03:50:04,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.829e+02 2.014e+02 2.258e+02 2.902e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-14 03:50:19,097 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:50:20,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1583358.0, ans=0.125 2023-10-14 03:50:24,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1583358.0, ans=0.1 2023-10-14 03:51:03,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1583498.0, ans=0.125 2023-10-14 03:51:05,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2023-10-14 03:51:07,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1583544.6666666667, ans=0.125 2023-10-14 03:51:25,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2023-10-14 03:51:30,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1583638.0, ans=0.2 2023-10-14 03:51:42,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1583684.6666666667, ans=0.0 2023-10-14 03:51:45,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1583684.6666666667, ans=0.0 2023-10-14 03:51:47,247 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=22.5 2023-10-14 03:51:50,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1583731.3333333333, ans=0.0 2023-10-14 03:51:52,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1583731.3333333333, ans=0.125 2023-10-14 03:52:02,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.784e+02 1.928e+02 2.155e+02 4.130e+02, threshold=3.856e+02, percent-clipped=1.0 2023-10-14 03:52:32,128 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.83 vs. limit=15.0 2023-10-14 03:52:40,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1583918.0, ans=0.0 2023-10-14 03:52:47,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1583964.6666666667, ans=0.125 2023-10-14 03:52:50,281 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-10-14 03:52:52,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1583964.6666666667, ans=0.2 2023-10-14 03:53:12,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.40 vs. limit=6.0 2023-10-14 03:53:28,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1584151.3333333333, ans=0.035 2023-10-14 03:53:33,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1584151.3333333333, ans=0.125 2023-10-14 03:53:36,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584151.3333333333, ans=0.125 2023-10-14 03:54:01,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.811e+02 1.953e+02 2.098e+02 2.891e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-14 03:54:08,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1584244.6666666667, ans=0.125 2023-10-14 03:54:31,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1584338.0, ans=0.0 2023-10-14 03:54:43,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1584384.6666666667, ans=0.125 2023-10-14 03:54:44,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1584384.6666666667, ans=0.0 2023-10-14 03:54:50,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-14 03:54:55,432 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:55:16,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1584524.6666666667, ans=0.0 2023-10-14 03:55:19,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1584524.6666666667, ans=0.2 2023-10-14 03:55:29,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1584571.3333333333, ans=0.125 2023-10-14 03:55:32,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1584571.3333333333, ans=0.0 2023-10-14 03:55:34,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1584571.3333333333, ans=0.125 2023-10-14 03:55:45,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1584618.0, ans=0.1 2023-10-14 03:55:51,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.44 vs. limit=15.0 2023-10-14 03:55:56,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=15.0 2023-10-14 03:55:57,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1584664.6666666667, ans=0.125 2023-10-14 03:56:00,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.807e+02 1.991e+02 2.168e+02 2.837e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-14 03:56:02,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=15.0 2023-10-14 03:56:06,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.84 vs. limit=15.0 2023-10-14 03:56:06,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1584711.3333333333, ans=0.2 2023-10-14 03:56:07,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1584711.3333333333, ans=0.2 2023-10-14 03:56:11,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1584758.0, ans=0.0 2023-10-14 03:56:34,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1584804.6666666667, ans=0.0 2023-10-14 03:56:41,631 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=22.5 2023-10-14 03:57:09,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1584991.3333333333, ans=0.0 2023-10-14 03:57:22,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.14 vs. limit=10.0 2023-10-14 03:57:41,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1585084.6666666667, ans=0.0 2023-10-14 03:57:42,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1585084.6666666667, ans=0.2 2023-10-14 03:57:57,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.764e+02 1.945e+02 2.100e+02 2.828e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-14 03:57:58,534 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:58:06,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1585178.0, ans=0.1 2023-10-14 03:58:41,679 INFO [train.py:1031] (3/4) Epoch 25, batch 12000, loss[loss=0.1865, simple_loss=0.2766, pruned_loss=0.04819, over 16565.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2788, pruned_loss=0.04691, over 32698870.39 frames. ], batch size: 66, lr: 1.36e-03, grad_scale: 32.0 2023-10-14 03:58:52,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1585364.6666666667, ans=0.0 2023-10-14 03:59:02,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.36 vs. limit=6.0 2023-10-14 03:59:04,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1585411.3333333333, ans=0.125 2023-10-14 03:59:14,418 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2023-10-14 03:59:36,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-10-14 03:59:38,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1585551.3333333333, ans=0.1 2023-10-14 03:59:44,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1585598.0, ans=0.07 2023-10-14 03:59:52,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.840e+02 2.019e+02 2.275e+02 3.419e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 04:00:15,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1585738.0, ans=0.2 2023-10-14 04:00:42,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1585831.3333333333, ans=0.125 2023-10-14 04:00:43,410 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:00:57,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1585878.0, ans=0.5 2023-10-14 04:00:59,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1585924.6666666667, ans=0.125 2023-10-14 04:00:59,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1585924.6666666667, ans=0.125 2023-10-14 04:01:03,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1585924.6666666667, ans=0.125 2023-10-14 04:01:14,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1585971.3333333333, ans=0.0 2023-10-14 04:01:44,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.768e+02 1.931e+02 2.169e+02 3.392e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 04:01:47,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1586111.3333333333, ans=0.0 2023-10-14 04:02:00,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586158.0, ans=0.1 2023-10-14 04:02:13,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1586204.6666666667, ans=0.2 2023-10-14 04:02:18,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1586251.3333333333, ans=0.125 2023-10-14 04:02:33,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1586298.0, ans=0.0 2023-10-14 04:02:40,131 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:02:50,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1586391.3333333333, ans=0.125 2023-10-14 04:02:55,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-10-14 04:03:30,966 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:03:32,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1586578.0, ans=0.0 2023-10-14 04:03:33,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.854e+02 2.003e+02 2.173e+02 4.272e+02, threshold=4.005e+02, percent-clipped=1.0 2023-10-14 04:03:35,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586578.0, ans=0.1 2023-10-14 04:03:36,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1586578.0, ans=0.125 2023-10-14 04:03:37,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1586578.0, ans=0.125 2023-10-14 04:04:05,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1586671.3333333333, ans=0.0 2023-10-14 04:04:05,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1586671.3333333333, ans=0.0 2023-10-14 04:04:23,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586764.6666666667, ans=0.1 2023-10-14 04:04:38,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1586811.3333333333, ans=0.07 2023-10-14 04:04:38,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1586811.3333333333, ans=0.0 2023-10-14 04:04:41,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.28 vs. limit=15.0 2023-10-14 04:04:53,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1586904.6666666667, ans=0.0 2023-10-14 04:04:59,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1586904.6666666667, ans=0.0 2023-10-14 04:05:04,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-14 04:05:06,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1586951.3333333333, ans=0.2 2023-10-14 04:05:09,633 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:05:09,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1586951.3333333333, ans=0.2 2023-10-14 04:05:11,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=12.0 2023-10-14 04:05:28,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.787e+02 1.905e+02 2.121e+02 2.986e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-14 04:05:29,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1587044.6666666667, ans=0.2 2023-10-14 04:05:34,475 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-10-14 04:06:26,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1587278.0, ans=0.125 2023-10-14 04:06:28,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1587278.0, ans=0.125 2023-10-14 04:06:30,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1587278.0, ans=0.0 2023-10-14 04:06:32,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1587278.0, ans=0.125 2023-10-14 04:06:36,835 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-10-14 04:06:44,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=15.0 2023-10-14 04:06:49,111 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-14 04:07:02,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1587418.0, ans=0.125 2023-10-14 04:07:03,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1587418.0, ans=0.125 2023-10-14 04:07:07,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1587418.0, ans=0.0 2023-10-14 04:07:24,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.843e+02 1.993e+02 2.239e+02 3.740e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-14 04:07:25,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1587511.3333333333, ans=0.125 2023-10-14 04:07:30,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1587511.3333333333, ans=0.0 2023-10-14 04:07:38,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1587558.0, ans=0.1 2023-10-14 04:07:46,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1587604.6666666667, ans=0.125 2023-10-14 04:07:51,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587604.6666666667, ans=0.1 2023-10-14 04:07:59,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-10-14 04:08:09,075 INFO [train.py:1031] (3/4) Epoch 25, batch 12500, loss[loss=0.2012, simple_loss=0.2936, pruned_loss=0.05439, over 16904.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2786, pruned_loss=0.04692, over 32739408.14 frames. ], batch size: 130, lr: 1.36e-03, grad_scale: 32.0 2023-10-14 04:08:20,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1587744.6666666667, ans=0.0 2023-10-14 04:08:31,532 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.99 vs. limit=15.0 2023-10-14 04:08:52,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1587884.6666666667, ans=0.125 2023-10-14 04:08:58,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2023-10-14 04:09:15,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.752e+02 1.888e+02 2.132e+02 2.843e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-14 04:09:21,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1587978.0, ans=0.0 2023-10-14 04:09:23,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1588024.6666666667, ans=0.0 2023-10-14 04:09:51,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.23 vs. limit=10.0 2023-10-14 04:09:57,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1588118.0, ans=0.125 2023-10-14 04:09:59,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1588164.6666666667, ans=0.1 2023-10-14 04:10:08,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1588164.6666666667, ans=0.125 2023-10-14 04:10:17,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1588211.3333333333, ans=0.125 2023-10-14 04:10:18,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1588211.3333333333, ans=0.125 2023-10-14 04:10:35,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.72 vs. limit=22.5 2023-10-14 04:11:09,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.841e+02 2.060e+02 2.337e+02 3.341e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-14 04:11:42,429 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-10-14 04:11:46,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588584.6666666667, ans=0.125 2023-10-14 04:11:55,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.60 vs. limit=15.0 2023-10-14 04:12:01,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1588678.0, ans=0.125 2023-10-14 04:12:08,443 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-14 04:12:12,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1588724.6666666667, ans=0.125 2023-10-14 04:12:27,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1588771.3333333333, ans=0.1 2023-10-14 04:12:38,891 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2023-10-14 04:12:51,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-10-14 04:12:53,012 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.33 vs. limit=15.0 2023-10-14 04:12:57,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.829e+02 2.004e+02 2.201e+02 3.015e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 04:13:00,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-10-14 04:13:33,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1589051.3333333333, ans=0.0 2023-10-14 04:13:33,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1589051.3333333333, ans=0.125 2023-10-14 04:13:56,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1589144.6666666667, ans=0.0 2023-10-14 04:14:19,078 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2023-10-14 04:14:24,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1589284.6666666667, ans=0.125 2023-10-14 04:14:25,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1589284.6666666667, ans=0.125 2023-10-14 04:14:28,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1589284.6666666667, ans=0.2 2023-10-14 04:14:42,198 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:14:47,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1589378.0, ans=0.0 2023-10-14 04:14:48,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.826e+02 1.979e+02 2.232e+02 2.883e+02, threshold=3.958e+02, percent-clipped=0.0 2023-10-14 04:15:02,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1589424.6666666667, ans=0.0 2023-10-14 04:15:13,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1589471.3333333333, ans=10.0 2023-10-14 04:15:14,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1589471.3333333333, ans=0.125 2023-10-14 04:15:33,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1589564.6666666667, ans=0.125 2023-10-14 04:15:34,784 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-10-14 04:15:39,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1589611.3333333333, ans=0.125 2023-10-14 04:15:44,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1589611.3333333333, ans=10.0 2023-10-14 04:15:53,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1589658.0, ans=0.125 2023-10-14 04:16:12,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1589751.3333333333, ans=0.125 2023-10-14 04:16:38,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.778e+02 1.891e+02 2.122e+02 2.931e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-14 04:16:42,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1589844.6666666667, ans=0.05 2023-10-14 04:16:43,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1589844.6666666667, ans=0.04949747468305833 2023-10-14 04:16:57,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1589938.0, ans=0.0 2023-10-14 04:17:07,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1589984.6666666667, ans=0.2 2023-10-14 04:17:08,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1589984.6666666667, ans=0.2 2023-10-14 04:17:10,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1589984.6666666667, ans=0.125 2023-10-14 04:17:12,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1589984.6666666667, ans=0.125 2023-10-14 04:17:15,475 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:17:16,728 INFO [train.py:1031] (3/4) Epoch 25, batch 13000, loss[loss=0.1755, simple_loss=0.2678, pruned_loss=0.04166, over 16889.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2792, pruned_loss=0.04714, over 32720935.47 frames. ], batch size: 146, lr: 1.36e-03, grad_scale: 16.0 2023-10-14 04:17:17,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1590031.3333333333, ans=0.125 2023-10-14 04:17:21,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1590031.3333333333, ans=0.125 2023-10-14 04:17:27,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1590078.0, ans=0.0 2023-10-14 04:17:27,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1590078.0, ans=0.125 2023-10-14 04:17:53,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1590124.6666666667, ans=0.125 2023-10-14 04:17:58,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1590171.3333333333, ans=0.125 2023-10-14 04:18:05,901 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.33 vs. limit=10.0 2023-10-14 04:18:36,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.848e+02 2.011e+02 2.291e+02 3.112e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-14 04:18:49,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1590358.0, ans=0.125 2023-10-14 04:19:09,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-10-14 04:19:19,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1590451.3333333333, ans=0.125 2023-10-14 04:19:23,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-10-14 04:19:26,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1590498.0, ans=0.125 2023-10-14 04:20:10,737 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.72 vs. limit=15.0 2023-10-14 04:20:15,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1590731.3333333333, ans=0.125 2023-10-14 04:20:23,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.63 vs. limit=6.0 2023-10-14 04:20:27,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-10-14 04:20:32,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.813e+02 2.024e+02 2.276e+02 3.438e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-14 04:20:38,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1590824.6666666667, ans=0.1 2023-10-14 04:20:53,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2023-10-14 04:21:02,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1590871.3333333333, ans=0.125 2023-10-14 04:21:18,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1590964.6666666667, ans=0.2 2023-10-14 04:21:25,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590964.6666666667, ans=0.1 2023-10-14 04:21:34,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1591011.3333333333, ans=0.125 2023-10-14 04:22:10,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1591198.0, ans=0.2 2023-10-14 04:22:14,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1591198.0, ans=0.5 2023-10-14 04:22:17,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1591198.0, ans=0.125 2023-10-14 04:22:23,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.739e+02 1.959e+02 2.224e+02 3.224e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 04:22:45,650 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2023-10-14 04:22:53,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1591384.6666666667, ans=0.125 2023-10-14 04:22:55,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1591384.6666666667, ans=0.0 2023-10-14 04:23:04,374 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1591384.6666666667, ans=0.125 2023-10-14 04:23:16,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-10-14 04:23:25,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1591478.0, ans=0.125 2023-10-14 04:23:25,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1591478.0, ans=0.1 2023-10-14 04:23:36,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1591524.6666666667, ans=0.0 2023-10-14 04:23:39,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-10-14 04:23:46,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1591571.3333333333, ans=0.125 2023-10-14 04:24:13,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.862e+02 2.005e+02 2.165e+02 2.827e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-14 04:24:19,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-10-14 04:24:47,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1591851.3333333333, ans=0.125 2023-10-14 04:24:48,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1591851.3333333333, ans=0.035 2023-10-14 04:25:01,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1591898.0, ans=0.125 2023-10-14 04:25:19,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1591944.6666666667, ans=0.0 2023-10-14 04:25:20,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-10-14 04:25:44,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=1592084.6666666667, ans=12.0 2023-10-14 04:25:46,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1592084.6666666667, ans=0.125 2023-10-14 04:25:51,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1592084.6666666667, ans=0.07 2023-10-14 04:25:56,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1592131.3333333333, ans=0.125 2023-10-14 04:26:04,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.45 vs. limit=10.0 2023-10-14 04:26:06,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1592178.0, ans=0.1 2023-10-14 04:26:08,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.787e+02 2.004e+02 2.276e+02 3.156e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 04:26:19,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1592224.6666666667, ans=0.125 2023-10-14 04:26:29,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1592271.3333333333, ans=0.125 2023-10-14 04:26:38,112 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-14 04:26:38,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.76 vs. limit=22.5 2023-10-14 04:26:47,115 INFO [train.py:1031] (3/4) Epoch 25, batch 13500, loss[loss=0.1617, simple_loss=0.263, pruned_loss=0.03024, over 16891.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2788, pruned_loss=0.04701, over 32765230.10 frames. ], batch size: 104, lr: 1.36e-03, grad_scale: 16.0 2023-10-14 04:27:32,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1592551.3333333333, ans=0.125 2023-10-14 04:27:34,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=12.0 2023-10-14 04:27:36,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1592551.3333333333, ans=0.0 2023-10-14 04:27:40,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1592551.3333333333, ans=0.0 2023-10-14 04:27:45,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-14 04:27:54,791 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.14 vs. limit=10.0 2023-10-14 04:27:57,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.784e+02 1.917e+02 2.135e+02 2.793e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-14 04:27:57,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1592644.6666666667, ans=0.125 2023-10-14 04:27:58,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1592644.6666666667, ans=0.125 2023-10-14 04:28:08,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1592691.3333333333, ans=0.1 2023-10-14 04:28:12,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1592691.3333333333, ans=0.1 2023-10-14 04:28:12,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1592691.3333333333, ans=0.0 2023-10-14 04:28:15,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1592738.0, ans=0.125 2023-10-14 04:28:16,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=1592738.0, ans=10.0 2023-10-14 04:28:25,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1592784.6666666667, ans=0.125 2023-10-14 04:28:34,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1592784.6666666667, ans=0.0 2023-10-14 04:28:36,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1592831.3333333333, ans=0.125 2023-10-14 04:28:52,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1592878.0, ans=0.0 2023-10-14 04:28:59,585 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-10-14 04:29:14,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1593018.0, ans=0.125 2023-10-14 04:29:19,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1593018.0, ans=0.125 2023-10-14 04:30:00,983 INFO [train.py:1031] (3/4) Epoch 26, batch 0, loss[loss=0.1766, simple_loss=0.2731, pruned_loss=0.04006, over 16389.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2731, pruned_loss=0.04006, over 16389.00 frames. ], batch size: 44, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 04:30:00,984 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-14 04:30:09,225 INFO [train.py:1063] (3/4) Epoch 26, validation: loss=0.2137, simple_loss=0.3003, pruned_loss=0.06359, over 1020973.00 frames. 2023-10-14 04:30:09,226 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-14 04:30:12,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2023-10-14 04:30:13,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1593088.0, ans=0.0 2023-10-14 04:30:18,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1593088.0, ans=0.125 2023-10-14 04:30:19,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.775e+02 1.928e+02 2.228e+02 3.655e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-14 04:30:22,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1593134.6666666667, ans=0.125 2023-10-14 04:30:23,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1593134.6666666667, ans=0.1 2023-10-14 04:30:23,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1593134.6666666667, ans=0.2 2023-10-14 04:30:28,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=22.5 2023-10-14 04:30:36,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1593181.3333333333, ans=0.125 2023-10-14 04:31:14,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1593321.3333333333, ans=0.125 2023-10-14 04:31:17,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593368.0, ans=0.1 2023-10-14 04:31:17,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1593368.0, ans=0.09899494936611666 2023-10-14 04:31:22,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1593368.0, ans=0.0 2023-10-14 04:31:37,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.86 vs. limit=15.0 2023-10-14 04:31:43,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1593461.3333333333, ans=0.2 2023-10-14 04:32:00,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1593508.0, ans=0.0 2023-10-14 04:32:04,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-10-14 04:32:12,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.719e+02 1.848e+02 2.028e+02 2.741e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-14 04:32:19,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1593601.3333333333, ans=0.125 2023-10-14 04:32:22,299 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:32:26,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-14 04:32:28,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1593648.0, ans=0.125 2023-10-14 04:32:40,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1593694.6666666667, ans=0.2 2023-10-14 04:32:45,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1593741.3333333333, ans=0.125 2023-10-14 04:33:07,198 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-10-14 04:33:20,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1593881.3333333333, ans=0.0 2023-10-14 04:33:27,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1593928.0, ans=0.2 2023-10-14 04:33:30,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1593928.0, ans=0.125 2023-10-14 04:33:34,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1593928.0, ans=0.125 2023-10-14 04:33:45,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593974.6666666667, ans=0.1 2023-10-14 04:33:53,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1594021.3333333333, ans=0.0 2023-10-14 04:33:59,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.796e+02 1.982e+02 2.152e+02 2.667e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-14 04:34:23,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594161.3333333333, ans=0.1 2023-10-14 04:34:26,670 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:34:48,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1594254.6666666667, ans=0.125 2023-10-14 04:34:51,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1594254.6666666667, ans=0.125 2023-10-14 04:35:41,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-10-14 04:35:45,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1594488.0, ans=0.125 2023-10-14 04:35:51,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.741e+02 1.949e+02 2.192e+02 3.714e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-14 04:36:24,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1594674.6666666667, ans=0.125 2023-10-14 04:36:27,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1594674.6666666667, ans=0.125 2023-10-14 04:36:29,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594674.6666666667, ans=0.1 2023-10-14 04:36:36,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1594721.3333333333, ans=0.125 2023-10-14 04:36:43,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1594721.3333333333, ans=0.0 2023-10-14 04:37:00,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1594814.6666666667, ans=0.125 2023-10-14 04:37:04,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1594814.6666666667, ans=0.05 2023-10-14 04:37:10,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.38 vs. limit=10.0 2023-10-14 04:37:41,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.775e+02 1.950e+02 2.231e+02 3.484e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-14 04:37:48,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1595001.3333333333, ans=0.125 2023-10-14 04:38:00,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-10-14 04:38:17,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1595141.3333333333, ans=0.125 2023-10-14 04:38:27,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1595188.0, ans=0.125 2023-10-14 04:38:29,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1595188.0, ans=0.0 2023-10-14 04:38:41,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1595234.6666666667, ans=0.125 2023-10-14 04:38:44,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.68 vs. limit=15.0 2023-10-14 04:39:04,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.02 vs. limit=10.0 2023-10-14 04:39:15,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1595374.6666666667, ans=0.0 2023-10-14 04:39:22,880 INFO [train.py:1031] (3/4) Epoch 26, batch 500, loss[loss=0.2173, simple_loss=0.2926, pruned_loss=0.07104, over 15646.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2786, pruned_loss=0.04779, over 7238840.98 frames. ], batch size: 350, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 04:39:27,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.97 vs. limit=22.5 2023-10-14 04:39:34,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.792e+02 1.991e+02 2.238e+02 3.146e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-14 04:39:39,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1595468.0, ans=0.125 2023-10-14 04:39:41,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-10-14 04:39:46,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1595514.6666666667, ans=0.0 2023-10-14 04:39:47,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1595514.6666666667, ans=0.2 2023-10-14 04:40:23,723 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2023-10-14 04:40:36,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.21 vs. limit=10.0 2023-10-14 04:41:05,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-10-14 04:41:07,481 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2023-10-14 04:41:14,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1595888.0, ans=0.0 2023-10-14 04:41:19,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1595888.0, ans=0.1 2023-10-14 04:41:24,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.890e+02 2.140e+02 2.348e+02 3.064e+02, threshold=4.280e+02, percent-clipped=0.0 2023-10-14 04:41:32,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1595934.6666666667, ans=0.125 2023-10-14 04:41:38,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1595981.3333333333, ans=0.0 2023-10-14 04:41:59,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1596074.6666666667, ans=0.2 2023-10-14 04:42:19,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1596168.0, ans=0.07 2023-10-14 04:42:20,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1596168.0, ans=0.125 2023-10-14 04:42:28,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1596214.6666666667, ans=0.125 2023-10-14 04:42:50,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1596308.0, ans=0.5 2023-10-14 04:43:00,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1596354.6666666667, ans=0.125 2023-10-14 04:43:04,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1596354.6666666667, ans=0.125 2023-10-14 04:43:08,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1596354.6666666667, ans=0.0 2023-10-14 04:43:11,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1596401.3333333333, ans=0.125 2023-10-14 04:43:12,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.819e+02 1.986e+02 2.249e+02 2.926e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-14 04:43:52,558 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-10-14 04:43:54,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1596541.3333333333, ans=0.0 2023-10-14 04:44:03,779 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.63 vs. limit=15.0 2023-10-14 04:44:15,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1596634.6666666667, ans=0.125 2023-10-14 04:44:15,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1596634.6666666667, ans=0.95 2023-10-14 04:44:22,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1596681.3333333333, ans=0.125 2023-10-14 04:44:38,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1596728.0, ans=0.125 2023-10-14 04:44:47,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=15.0 2023-10-14 04:45:04,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.762e+02 1.887e+02 2.105e+02 3.657e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-14 04:45:14,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1596914.6666666667, ans=0.2 2023-10-14 04:45:15,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1596914.6666666667, ans=0.0 2023-10-14 04:45:41,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1597008.0, ans=0.0 2023-10-14 04:45:46,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1597008.0, ans=0.125 2023-10-14 04:45:57,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1597054.6666666667, ans=0.125 2023-10-14 04:46:11,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1597101.3333333333, ans=0.125 2023-10-14 04:46:24,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1597194.6666666667, ans=0.025 2023-10-14 04:46:29,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1597194.6666666667, ans=0.0 2023-10-14 04:46:58,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.844e+02 1.985e+02 2.266e+02 3.163e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-14 04:47:04,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=22.5 2023-10-14 04:47:35,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1597474.6666666667, ans=0.125 2023-10-14 04:47:40,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1597474.6666666667, ans=0.0 2023-10-14 04:47:49,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1597521.3333333333, ans=0.0 2023-10-14 04:48:38,341 INFO [train.py:1031] (3/4) Epoch 26, batch 1000, loss[loss=0.1782, simple_loss=0.2764, pruned_loss=0.04001, over 16910.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2793, pruned_loss=0.04788, over 12872944.67 frames. ], batch size: 82, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 04:48:38,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1597754.6666666667, ans=12.0 2023-10-14 04:48:46,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1597754.6666666667, ans=0.125 2023-10-14 04:48:50,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1597801.3333333333, ans=0.0 2023-10-14 04:48:51,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.721e+02 1.930e+02 2.104e+02 3.145e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-14 04:49:08,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1597848.0, ans=0.05 2023-10-14 04:49:13,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1597894.6666666667, ans=0.125 2023-10-14 04:49:16,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597894.6666666667, ans=0.1 2023-10-14 04:49:22,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1597941.3333333333, ans=0.035 2023-10-14 04:49:36,068 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.58 vs. limit=15.0 2023-10-14 04:49:40,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1598034.6666666667, ans=0.125 2023-10-14 04:49:42,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1598034.6666666667, ans=0.125 2023-10-14 04:49:44,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1598034.6666666667, ans=0.125 2023-10-14 04:49:44,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1598034.6666666667, ans=0.2 2023-10-14 04:50:02,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1598128.0, ans=0.0 2023-10-14 04:50:15,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1598174.6666666667, ans=0.125 2023-10-14 04:50:23,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1598221.3333333333, ans=0.125 2023-10-14 04:50:37,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.797e+02 1.957e+02 2.116e+02 2.863e+02, threshold=3.915e+02, percent-clipped=0.0 2023-10-14 04:50:39,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-10-14 04:50:58,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1598314.6666666667, ans=0.125 2023-10-14 04:51:00,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1598361.3333333333, ans=0.125 2023-10-14 04:51:11,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1598408.0, ans=0.125 2023-10-14 04:51:14,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1598408.0, ans=0.125 2023-10-14 04:51:15,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1598408.0, ans=0.0 2023-10-14 04:51:23,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1598454.6666666667, ans=0.125 2023-10-14 04:51:36,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1598501.3333333333, ans=0.125 2023-10-14 04:52:26,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1598688.0, ans=0.2 2023-10-14 04:52:34,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.745e+02 1.887e+02 2.132e+02 3.249e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-14 04:52:35,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1598734.6666666667, ans=0.125 2023-10-14 04:52:41,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1598734.6666666667, ans=0.2 2023-10-14 04:52:49,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.55 vs. limit=22.5 2023-10-14 04:53:06,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1598874.6666666667, ans=0.0 2023-10-14 04:53:09,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1598874.6666666667, ans=0.125 2023-10-14 04:53:42,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1599014.6666666667, ans=0.0 2023-10-14 04:53:50,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1599061.3333333333, ans=0.125 2023-10-14 04:54:01,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1599108.0, ans=0.1 2023-10-14 04:54:10,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1599108.0, ans=0.07 2023-10-14 04:54:18,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-10-14 04:54:25,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.748e+02 1.911e+02 2.112e+02 3.133e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-14 04:54:47,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1599294.6666666667, ans=0.0 2023-10-14 04:54:54,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1599294.6666666667, ans=0.125 2023-10-14 04:54:55,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1599341.3333333333, ans=0.2 2023-10-14 04:55:01,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1599341.3333333333, ans=0.0 2023-10-14 04:55:05,202 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.02 vs. limit=15.0 2023-10-14 04:55:09,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.86 vs. limit=15.0 2023-10-14 04:55:16,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-10-14 04:55:20,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.29 vs. limit=15.0 2023-10-14 04:55:33,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1599481.3333333333, ans=0.125 2023-10-14 04:55:34,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1599481.3333333333, ans=0.1 2023-10-14 04:55:50,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1599574.6666666667, ans=0.2 2023-10-14 04:55:58,007 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:56:00,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1599621.3333333333, ans=0.05 2023-10-14 04:56:16,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.724e+02 1.860e+02 2.086e+02 3.127e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-14 04:56:30,709 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-10-14 04:56:30,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.03 vs. limit=10.0 2023-10-14 04:56:35,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1599761.3333333333, ans=0.2 2023-10-14 04:56:55,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1599808.0, ans=0.0 2023-10-14 04:57:00,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1599854.6666666667, ans=0.125 2023-10-14 04:57:05,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1599854.6666666667, ans=0.2 2023-10-14 04:57:06,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1599854.6666666667, ans=0.125 2023-10-14 04:57:33,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-10-14 04:57:40,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=15.0 2023-10-14 04:57:57,939 INFO [train.py:1031] (3/4) Epoch 26, batch 1500, loss[loss=0.1756, simple_loss=0.2667, pruned_loss=0.04229, over 16092.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.278, pruned_loss=0.04716, over 17272410.82 frames. ], batch size: 43, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 04:58:02,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-10-14 04:58:07,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1600088.0, ans=0.1 2023-10-14 04:58:13,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.839e+02 1.976e+02 2.250e+02 2.759e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-14 04:58:14,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-14 04:58:18,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=12.0 2023-10-14 04:58:20,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1600181.3333333333, ans=0.07 2023-10-14 04:58:25,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1600181.3333333333, ans=0.125 2023-10-14 04:58:27,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1600181.3333333333, ans=0.0 2023-10-14 04:58:36,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1600228.0, ans=0.125 2023-10-14 04:58:40,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1600228.0, ans=0.2 2023-10-14 04:59:01,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2023-10-14 04:59:20,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1600414.6666666667, ans=0.0 2023-10-14 04:59:33,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600461.3333333333, ans=0.1 2023-10-14 04:59:36,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1600461.3333333333, ans=0.0 2023-10-14 04:59:40,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1600508.0, ans=0.125 2023-10-14 04:59:54,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.72 vs. limit=22.5 2023-10-14 05:00:04,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.759e+02 1.874e+02 2.075e+02 2.983e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-14 05:00:09,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1600601.3333333333, ans=0.125 2023-10-14 05:00:16,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1600648.0, ans=0.0 2023-10-14 05:00:25,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1600694.6666666667, ans=0.125 2023-10-14 05:00:47,046 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:00:50,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1600741.3333333333, ans=0.09899494936611666 2023-10-14 05:01:11,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600834.6666666667, ans=0.1 2023-10-14 05:01:52,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.60 vs. limit=15.0 2023-10-14 05:02:02,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.805e+02 1.959e+02 2.213e+02 2.892e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 05:02:30,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1601208.0, ans=0.2 2023-10-14 05:02:42,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1601254.6666666667, ans=0.2 2023-10-14 05:02:43,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601254.6666666667, ans=0.1 2023-10-14 05:03:01,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1601301.3333333333, ans=0.125 2023-10-14 05:03:05,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1601348.0, ans=0.2 2023-10-14 05:03:14,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-10-14 05:03:20,752 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1601394.6666666667, ans=0.035 2023-10-14 05:03:34,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1601441.3333333333, ans=0.125 2023-10-14 05:03:35,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-10-14 05:03:56,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1601534.6666666667, ans=0.1 2023-10-14 05:04:01,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.809e+02 1.938e+02 2.187e+02 3.176e+02, threshold=3.876e+02, percent-clipped=0.0 2023-10-14 05:04:04,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1601534.6666666667, ans=0.125 2023-10-14 05:04:35,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1601674.6666666667, ans=0.0 2023-10-14 05:04:48,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1601721.3333333333, ans=0.125 2023-10-14 05:05:10,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601814.6666666667, ans=0.1 2023-10-14 05:05:31,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-10-14 05:05:34,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1601908.0, ans=0.125 2023-10-14 05:05:58,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.829e+02 2.068e+02 2.266e+02 3.103e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-14 05:05:59,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1602001.3333333333, ans=0.0 2023-10-14 05:06:01,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1602001.3333333333, ans=0.125 2023-10-14 05:06:03,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602001.3333333333, ans=0.1 2023-10-14 05:06:20,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1602094.6666666667, ans=0.0 2023-10-14 05:06:21,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=22.5 2023-10-14 05:06:28,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1602094.6666666667, ans=0.1 2023-10-14 05:06:48,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1602188.0, ans=10.0 2023-10-14 05:07:23,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1602281.3333333333, ans=0.0 2023-10-14 05:07:27,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602328.0, ans=0.1 2023-10-14 05:07:40,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1602374.6666666667, ans=0.125 2023-10-14 05:07:42,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1602374.6666666667, ans=0.0 2023-10-14 05:07:54,759 INFO [train.py:1031] (3/4) Epoch 26, batch 2000, loss[loss=0.1795, simple_loss=0.2833, pruned_loss=0.03781, over 16657.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2787, pruned_loss=0.04748, over 20679059.01 frames. ], batch size: 202, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 05:08:10,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.760e+02 1.951e+02 2.176e+02 3.927e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-14 05:08:13,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1602468.0, ans=0.2 2023-10-14 05:08:20,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1602514.6666666667, ans=0.125 2023-10-14 05:08:51,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1602608.0, ans=0.125 2023-10-14 05:08:57,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1602608.0, ans=0.125 2023-10-14 05:09:00,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602608.0, ans=0.1 2023-10-14 05:09:19,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1602701.3333333333, ans=0.0 2023-10-14 05:10:03,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1602841.3333333333, ans=0.125 2023-10-14 05:10:10,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1602888.0, ans=0.125 2023-10-14 05:10:34,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.768e+02 1.956e+02 2.229e+02 3.228e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-14 05:10:41,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1602934.6666666667, ans=0.0 2023-10-14 05:10:53,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-10-14 05:12:20,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603261.3333333333, ans=0.1 2023-10-14 05:12:35,671 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=22.5 2023-10-14 05:12:40,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1603354.6666666667, ans=0.125 2023-10-14 05:12:45,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1603354.6666666667, ans=0.0 2023-10-14 05:12:47,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603401.3333333333, ans=0.1 2023-10-14 05:12:50,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.848e+02 2.000e+02 2.217e+02 3.263e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-14 05:13:04,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1603448.0, ans=0.125 2023-10-14 05:13:09,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1603494.6666666667, ans=0.125 2023-10-14 05:13:30,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-10-14 05:14:04,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.14 vs. limit=22.5 2023-10-14 05:14:14,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1603728.0, ans=0.125 2023-10-14 05:14:20,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1603774.6666666667, ans=0.2 2023-10-14 05:14:21,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1603774.6666666667, ans=0.05 2023-10-14 05:14:35,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1603821.3333333333, ans=0.0 2023-10-14 05:14:47,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.852e+02 1.963e+02 2.145e+02 2.907e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-14 05:15:00,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1603914.6666666667, ans=0.0 2023-10-14 05:15:08,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1603961.3333333333, ans=0.04949747468305833 2023-10-14 05:15:14,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1603961.3333333333, ans=0.0 2023-10-14 05:15:17,398 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:15:28,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1604054.6666666667, ans=0.125 2023-10-14 05:15:29,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1604054.6666666667, ans=0.125 2023-10-14 05:15:38,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1604054.6666666667, ans=0.0 2023-10-14 05:15:44,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1604101.3333333333, ans=0.0 2023-10-14 05:15:47,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.87 vs. limit=15.0 2023-10-14 05:15:54,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-14 05:15:59,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1604148.0, ans=0.07 2023-10-14 05:16:03,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-14 05:16:04,825 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.89 vs. limit=22.5 2023-10-14 05:16:10,353 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-10-14 05:16:30,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-10-14 05:16:31,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1604288.0, ans=0.0 2023-10-14 05:16:40,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.800e+02 1.946e+02 2.111e+02 3.235e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-14 05:16:47,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.96 vs. limit=10.0 2023-10-14 05:16:59,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604428.0, ans=0.1 2023-10-14 05:17:09,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1604474.6666666667, ans=0.0 2023-10-14 05:17:29,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1604521.3333333333, ans=0.0 2023-10-14 05:17:31,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1604568.0, ans=0.125 2023-10-14 05:17:44,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1604614.6666666667, ans=0.035 2023-10-14 05:18:02,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1604661.3333333333, ans=0.0 2023-10-14 05:18:15,728 INFO [train.py:1031] (3/4) Epoch 26, batch 2500, loss[loss=0.179, simple_loss=0.2763, pruned_loss=0.04082, over 16927.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2791, pruned_loss=0.04761, over 23359774.38 frames. ], batch size: 123, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 05:18:26,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1604801.3333333333, ans=0.125 2023-10-14 05:18:29,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.779e+02 1.962e+02 2.164e+02 2.703e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-14 05:18:41,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-10-14 05:18:43,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1604848.0, ans=0.125 2023-10-14 05:18:49,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1604894.6666666667, ans=0.09899494936611666 2023-10-14 05:18:55,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1604894.6666666667, ans=0.0 2023-10-14 05:18:56,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1604894.6666666667, ans=0.125 2023-10-14 05:19:02,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604941.3333333333, ans=0.1 2023-10-14 05:19:10,339 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-10-14 05:19:15,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1604988.0, ans=0.0 2023-10-14 05:19:45,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1605128.0, ans=0.2 2023-10-14 05:19:50,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1605128.0, ans=0.1 2023-10-14 05:19:55,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2023-10-14 05:19:59,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1605174.6666666667, ans=0.125 2023-10-14 05:20:21,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.857e+02 2.016e+02 2.266e+02 3.448e+02, threshold=4.031e+02, percent-clipped=0.0 2023-10-14 05:20:38,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605314.6666666667, ans=0.1 2023-10-14 05:20:45,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1605361.3333333333, ans=0.125 2023-10-14 05:20:51,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605361.3333333333, ans=0.1 2023-10-14 05:21:07,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1605454.6666666667, ans=0.0 2023-10-14 05:21:14,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-10-14 05:21:14,149 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.31 vs. limit=15.0 2023-10-14 05:21:27,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1605548.0, ans=0.0 2023-10-14 05:21:41,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1605594.6666666667, ans=0.0 2023-10-14 05:21:52,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1605641.3333333333, ans=0.0 2023-10-14 05:21:56,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1605641.3333333333, ans=0.0 2023-10-14 05:22:16,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1605734.6666666667, ans=0.0 2023-10-14 05:22:22,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1605734.6666666667, ans=0.125 2023-10-14 05:22:22,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1605734.6666666667, ans=15.0 2023-10-14 05:22:23,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.803e+02 1.932e+02 2.177e+02 3.401e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-14 05:22:34,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1605781.3333333333, ans=0.2 2023-10-14 05:22:40,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1605828.0, ans=0.125 2023-10-14 05:22:41,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605828.0, ans=0.1 2023-10-14 05:23:10,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1605921.3333333333, ans=0.0 2023-10-14 05:23:33,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1606014.6666666667, ans=0.125 2023-10-14 05:23:50,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-10-14 05:23:53,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1606061.3333333333, ans=0.0 2023-10-14 05:24:01,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1606108.0, ans=0.0 2023-10-14 05:24:26,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.762e+02 1.984e+02 2.143e+02 2.899e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 05:24:49,458 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=12.0 2023-10-14 05:25:02,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1606341.3333333333, ans=0.2 2023-10-14 05:25:04,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1606341.3333333333, ans=0.125 2023-10-14 05:25:27,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606434.6666666667, ans=0.0 2023-10-14 05:25:43,940 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:25:48,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1606481.3333333333, ans=0.0 2023-10-14 05:25:52,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1606528.0, ans=0.125 2023-10-14 05:26:17,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1606574.6666666667, ans=0.125 2023-10-14 05:26:35,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1606668.0, ans=0.5 2023-10-14 05:26:39,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.826e+02 1.972e+02 2.150e+02 3.040e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 05:26:47,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-10-14 05:26:53,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1606714.6666666667, ans=0.0 2023-10-14 05:27:17,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=12.0 2023-10-14 05:27:32,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.89 vs. limit=15.0 2023-10-14 05:27:41,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1606948.0, ans=0.035 2023-10-14 05:27:59,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1606994.6666666667, ans=0.1 2023-10-14 05:28:05,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.02 vs. limit=15.0 2023-10-14 05:28:13,057 INFO [train.py:1031] (3/4) Epoch 26, batch 3000, loss[loss=0.2213, simple_loss=0.2904, pruned_loss=0.07609, over 15667.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2785, pruned_loss=0.04749, over 25451085.33 frames. ], batch size: 350, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 05:28:18,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1607088.0, ans=0.0 2023-10-14 05:28:25,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1607134.6666666667, ans=0.125 2023-10-14 05:28:30,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.809e+02 1.948e+02 2.228e+02 4.084e+02, threshold=3.896e+02, percent-clipped=1.0 2023-10-14 05:29:36,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1607414.6666666667, ans=0.05 2023-10-14 05:29:56,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1607508.0, ans=0.1 2023-10-14 05:30:02,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1607508.0, ans=0.1 2023-10-14 05:30:07,508 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2023-10-14 05:30:26,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.843e+02 1.966e+02 2.171e+02 2.891e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-14 05:30:29,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1607601.3333333333, ans=0.0 2023-10-14 05:30:31,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1607648.0, ans=0.0 2023-10-14 05:30:45,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1607694.6666666667, ans=0.0 2023-10-14 05:31:05,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1607788.0, ans=0.2 2023-10-14 05:31:23,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1607834.6666666667, ans=0.0 2023-10-14 05:31:34,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1607881.3333333333, ans=0.125 2023-10-14 05:31:36,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=22.5 2023-10-14 05:32:02,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608021.3333333333, ans=0.1 2023-10-14 05:32:19,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.806e+02 1.986e+02 2.217e+02 2.889e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-14 05:32:21,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-10-14 05:32:43,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1608161.3333333333, ans=0.125 2023-10-14 05:32:46,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-10-14 05:32:58,014 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:33:23,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608301.3333333333, ans=0.1 2023-10-14 05:33:31,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608348.0, ans=0.1 2023-10-14 05:33:33,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-10-14 05:34:05,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=15.0 2023-10-14 05:34:06,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1608488.0, ans=0.125 2023-10-14 05:34:10,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1608488.0, ans=0.04949747468305833 2023-10-14 05:34:12,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1608488.0, ans=0.2 2023-10-14 05:34:21,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.802e+02 1.912e+02 2.095e+02 3.783e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-14 05:34:24,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1608534.6666666667, ans=0.0 2023-10-14 05:34:45,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1608628.0, ans=0.2 2023-10-14 05:34:53,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-10-14 05:35:31,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1608814.6666666667, ans=0.125 2023-10-14 05:35:32,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608861.3333333333, ans=0.1 2023-10-14 05:35:53,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1608908.0, ans=0.125 2023-10-14 05:35:55,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1608908.0, ans=0.0 2023-10-14 05:36:15,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1609001.3333333333, ans=0.0 2023-10-14 05:36:16,023 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:36:16,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.905e+02 2.066e+02 2.263e+02 3.148e+02, threshold=4.132e+02, percent-clipped=0.0 2023-10-14 05:36:20,217 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:36:28,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1609048.0, ans=0.95 2023-10-14 05:36:29,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1609048.0, ans=0.07 2023-10-14 05:36:36,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1609094.6666666667, ans=0.0 2023-10-14 05:36:36,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1609094.6666666667, ans=0.125 2023-10-14 05:37:17,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1609281.3333333333, ans=0.0 2023-10-14 05:37:28,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1609328.0, ans=0.125 2023-10-14 05:37:33,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1609328.0, ans=0.125 2023-10-14 05:37:36,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1609328.0, ans=0.0 2023-10-14 05:37:37,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1609374.6666666667, ans=0.125 2023-10-14 05:37:50,208 INFO [train.py:1031] (3/4) Epoch 26, batch 3500, loss[loss=0.2022, simple_loss=0.2997, pruned_loss=0.05237, over 16874.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2784, pruned_loss=0.04764, over 27046809.77 frames. ], batch size: 188, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 05:37:59,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1609421.3333333333, ans=0.05 2023-10-14 05:38:08,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1609468.0, ans=0.0 2023-10-14 05:38:08,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.802e+02 1.970e+02 2.146e+02 3.005e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-14 05:38:22,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.46 vs. limit=22.5 2023-10-14 05:38:24,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1609561.3333333333, ans=0.2 2023-10-14 05:38:33,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609561.3333333333, ans=0.1 2023-10-14 05:38:35,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1609608.0, ans=0.0 2023-10-14 05:38:54,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1609654.6666666667, ans=0.125 2023-10-14 05:39:02,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1609701.3333333333, ans=0.2 2023-10-14 05:39:30,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1609794.6666666667, ans=0.0 2023-10-14 05:39:38,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1609841.3333333333, ans=0.0 2023-10-14 05:39:57,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-10-14 05:40:05,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1609934.6666666667, ans=0.1 2023-10-14 05:40:08,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.826e+02 1.981e+02 2.215e+02 2.880e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 05:40:22,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1609981.3333333333, ans=0.125 2023-10-14 05:40:32,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1610028.0, ans=0.1 2023-10-14 05:40:41,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1610074.6666666667, ans=0.125 2023-10-14 05:40:46,542 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:40:47,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1610121.3333333333, ans=0.125 2023-10-14 05:40:49,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1610121.3333333333, ans=0.1 2023-10-14 05:40:54,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1610121.3333333333, ans=0.09899494936611666 2023-10-14 05:41:15,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1610214.6666666667, ans=0.125 2023-10-14 05:41:27,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1610261.3333333333, ans=0.125 2023-10-14 05:41:42,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1610354.6666666667, ans=0.125 2023-10-14 05:41:49,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1610354.6666666667, ans=0.125 2023-10-14 05:42:01,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.756e+02 1.897e+02 2.113e+02 2.615e+02, threshold=3.794e+02, percent-clipped=0.0 2023-10-14 05:42:20,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1610494.6666666667, ans=0.0 2023-10-14 05:42:28,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1610494.6666666667, ans=0.125 2023-10-14 05:42:34,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1610541.3333333333, ans=0.125 2023-10-14 05:42:36,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1610541.3333333333, ans=0.125 2023-10-14 05:42:56,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1610634.6666666667, ans=0.1 2023-10-14 05:43:10,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1610681.3333333333, ans=0.0 2023-10-14 05:43:12,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1610681.3333333333, ans=0.0 2023-10-14 05:43:31,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.97 vs. limit=22.5 2023-10-14 05:43:40,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1610774.6666666667, ans=0.125 2023-10-14 05:43:48,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1610821.3333333333, ans=0.1 2023-10-14 05:43:48,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1610821.3333333333, ans=0.125 2023-10-14 05:43:48,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1610821.3333333333, ans=0.0 2023-10-14 05:43:52,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1610821.3333333333, ans=0.125 2023-10-14 05:44:00,604 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1610868.0, ans=0.035 2023-10-14 05:44:03,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.794e+02 1.991e+02 2.183e+02 3.178e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-14 05:44:52,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1611101.3333333333, ans=0.2 2023-10-14 05:45:19,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1611194.6666666667, ans=0.125 2023-10-14 05:45:20,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-10-14 05:45:34,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1611241.3333333333, ans=0.125 2023-10-14 05:45:38,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1611288.0, ans=0.04949747468305833 2023-10-14 05:45:45,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-10-14 05:45:53,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.692e+02 1.840e+02 2.064e+02 2.845e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-14 05:45:56,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1611381.3333333333, ans=0.125 2023-10-14 05:45:58,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.27 vs. limit=22.5 2023-10-14 05:46:00,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-10-14 05:46:15,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1611428.0, ans=0.0 2023-10-14 05:46:17,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-10-14 05:46:21,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.51 vs. limit=15.0 2023-10-14 05:46:23,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1611474.6666666667, ans=0.125 2023-10-14 05:46:28,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1611474.6666666667, ans=0.125 2023-10-14 05:46:35,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1611521.3333333333, ans=0.125 2023-10-14 05:47:01,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1611614.6666666667, ans=0.125 2023-10-14 05:47:19,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1611708.0, ans=0.125 2023-10-14 05:47:28,128 INFO [train.py:1031] (3/4) Epoch 26, batch 4000, loss[loss=0.1762, simple_loss=0.273, pruned_loss=0.03973, over 16824.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2782, pruned_loss=0.04757, over 28322180.69 frames. ], batch size: 175, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 05:47:49,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.814e+02 1.963e+02 2.108e+02 3.118e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-14 05:47:59,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-10-14 05:47:59,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2023-10-14 05:48:13,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=15.0 2023-10-14 05:48:13,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.92 vs. limit=15.0 2023-10-14 05:48:16,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.14 vs. limit=15.0 2023-10-14 05:48:18,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1611941.3333333333, ans=0.125 2023-10-14 05:48:23,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1611941.3333333333, ans=0.2 2023-10-14 05:48:27,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1611988.0, ans=0.0 2023-10-14 05:48:29,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1611988.0, ans=0.1 2023-10-14 05:48:33,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1611988.0, ans=0.125 2023-10-14 05:48:38,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1612034.6666666667, ans=0.2 2023-10-14 05:48:40,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1612034.6666666667, ans=0.125 2023-10-14 05:48:53,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1612081.3333333333, ans=0.125 2023-10-14 05:49:06,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1612128.0, ans=0.0 2023-10-14 05:49:16,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-14 05:49:31,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1612268.0, ans=0.05 2023-10-14 05:49:37,549 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=12.0 2023-10-14 05:49:39,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.852e+02 1.947e+02 2.087e+02 2.679e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-14 05:49:45,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1612314.6666666667, ans=0.125 2023-10-14 05:50:02,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=12.0 2023-10-14 05:50:44,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1612548.0, ans=0.0 2023-10-14 05:51:14,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1612641.3333333333, ans=0.125 2023-10-14 05:51:47,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.774e+02 1.918e+02 2.120e+02 2.945e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-14 05:51:49,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1612781.3333333333, ans=0.125 2023-10-14 05:51:50,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=12.0 2023-10-14 05:51:55,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612781.3333333333, ans=0.1 2023-10-14 05:51:57,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1612781.3333333333, ans=0.0 2023-10-14 05:52:00,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1612828.0, ans=0.125 2023-10-14 05:52:00,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1612828.0, ans=0.125 2023-10-14 05:52:14,949 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:52:15,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-10-14 05:52:36,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1612968.0, ans=0.0 2023-10-14 05:52:43,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1612968.0, ans=0.125 2023-10-14 05:53:05,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=15.0 2023-10-14 05:53:37,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1613201.3333333333, ans=0.0 2023-10-14 05:53:39,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1613201.3333333333, ans=0.2 2023-10-14 05:53:40,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.764e+02 1.956e+02 2.128e+02 2.788e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-14 05:53:55,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1613294.6666666667, ans=0.125 2023-10-14 05:54:02,855 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.84 vs. limit=22.5 2023-10-14 05:54:07,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1613341.3333333333, ans=0.0 2023-10-14 05:54:15,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1613388.0, ans=0.05 2023-10-14 05:54:16,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1613388.0, ans=0.0 2023-10-14 05:54:40,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1613481.3333333333, ans=0.125 2023-10-14 05:54:54,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=15.0 2023-10-14 05:55:03,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1613574.6666666667, ans=0.125 2023-10-14 05:55:25,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1613668.0, ans=0.0 2023-10-14 05:55:27,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1613668.0, ans=0.1 2023-10-14 05:55:32,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.022e+02 2.203e+02 2.357e+02 3.298e+02, threshold=4.406e+02, percent-clipped=0.0 2023-10-14 05:55:38,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1613714.6666666667, ans=0.125 2023-10-14 05:55:48,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1613761.3333333333, ans=0.1 2023-10-14 05:55:49,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-10-14 05:56:02,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1613808.0, ans=0.125 2023-10-14 05:56:13,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1613808.0, ans=0.1 2023-10-14 05:56:15,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1613808.0, ans=0.2 2023-10-14 05:56:34,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613901.3333333333, ans=0.1 2023-10-14 05:56:37,476 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:57:06,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1614041.3333333333, ans=0.07 2023-10-14 05:57:14,442 INFO [train.py:1031] (3/4) Epoch 26, batch 4500, loss[loss=0.1723, simple_loss=0.2684, pruned_loss=0.03814, over 16912.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2783, pruned_loss=0.04716, over 29331719.52 frames. ], batch size: 77, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 05:57:14,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1614088.0, ans=0.0 2023-10-14 05:57:22,880 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-10-14 05:57:32,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1614134.6666666667, ans=0.125 2023-10-14 05:57:34,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.790e+02 1.968e+02 2.286e+02 3.128e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-14 05:58:12,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1614321.3333333333, ans=0.125 2023-10-14 05:58:55,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1614508.0, ans=0.0 2023-10-14 05:59:05,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=15.0 2023-10-14 05:59:21,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.832e+02 1.946e+02 2.212e+02 3.184e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-14 05:59:25,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1614648.0, ans=0.2 2023-10-14 05:59:44,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.67 vs. limit=15.0 2023-10-14 05:59:48,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1614741.3333333333, ans=0.2 2023-10-14 05:59:52,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.88 vs. limit=10.0 2023-10-14 06:00:11,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.47 vs. limit=22.5 2023-10-14 06:00:17,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1614881.3333333333, ans=0.0 2023-10-14 06:00:27,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=1614881.3333333333, ans=15.0 2023-10-14 06:00:28,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-10-14 06:00:40,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1614974.6666666667, ans=0.07 2023-10-14 06:00:43,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1614974.6666666667, ans=0.0 2023-10-14 06:00:46,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1614974.6666666667, ans=0.125 2023-10-14 06:00:57,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1615021.3333333333, ans=0.125 2023-10-14 06:01:05,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=12.0 2023-10-14 06:01:10,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.815e+02 1.977e+02 2.204e+02 3.336e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-14 06:01:34,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1615208.0, ans=0.0 2023-10-14 06:01:37,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1615208.0, ans=0.125 2023-10-14 06:01:38,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1615208.0, ans=0.125 2023-10-14 06:01:56,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1615301.3333333333, ans=0.0 2023-10-14 06:01:57,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1615301.3333333333, ans=0.125 2023-10-14 06:02:32,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1615441.3333333333, ans=0.125 2023-10-14 06:02:45,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1615488.0, ans=0.0 2023-10-14 06:02:47,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-10-14 06:02:48,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1615488.0, ans=0.125 2023-10-14 06:03:05,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.814e+02 2.000e+02 2.217e+02 2.949e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-14 06:03:27,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1615628.0, ans=0.125 2023-10-14 06:03:31,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1615674.6666666667, ans=0.0 2023-10-14 06:03:36,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1615674.6666666667, ans=0.125 2023-10-14 06:03:58,164 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:03:58,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1615768.0, ans=0.125 2023-10-14 06:04:19,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1615861.3333333333, ans=0.125 2023-10-14 06:04:24,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1615861.3333333333, ans=0.0 2023-10-14 06:04:31,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.59 vs. limit=6.0 2023-10-14 06:04:59,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.767e+02 1.917e+02 2.168e+02 3.169e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-14 06:05:04,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616048.0, ans=0.1 2023-10-14 06:05:07,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-10-14 06:05:14,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1616094.6666666667, ans=0.125 2023-10-14 06:05:41,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1616188.0, ans=0.125 2023-10-14 06:06:06,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1616281.3333333333, ans=0.125 2023-10-14 06:06:31,405 INFO [train.py:1031] (3/4) Epoch 26, batch 5000, loss[loss=0.1782, simple_loss=0.2689, pruned_loss=0.04375, over 15793.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2783, pruned_loss=0.04736, over 30100886.69 frames. ], batch size: 35, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 06:06:37,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.66 vs. limit=22.5 2023-10-14 06:06:44,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.90 vs. limit=22.5 2023-10-14 06:06:49,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1616468.0, ans=0.0 2023-10-14 06:06:52,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1616468.0, ans=0.015 2023-10-14 06:06:55,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.892e+02 2.053e+02 2.221e+02 2.952e+02, threshold=4.106e+02, percent-clipped=0.0 2023-10-14 06:07:20,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1616608.0, ans=0.125 2023-10-14 06:07:24,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1616608.0, ans=0.125 2023-10-14 06:07:26,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1616654.6666666667, ans=0.125 2023-10-14 06:07:27,842 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:07:43,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1616701.3333333333, ans=0.0 2023-10-14 06:07:56,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1616748.0, ans=0.0 2023-10-14 06:08:02,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-10-14 06:08:03,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1616794.6666666667, ans=0.0 2023-10-14 06:08:34,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1616888.0, ans=0.125 2023-10-14 06:08:48,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1616981.3333333333, ans=0.09899494936611666 2023-10-14 06:08:49,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.799e+02 1.949e+02 2.163e+02 2.934e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-14 06:09:21,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1617074.6666666667, ans=0.125 2023-10-14 06:09:24,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1617121.3333333333, ans=0.125 2023-10-14 06:09:30,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1617121.3333333333, ans=0.025 2023-10-14 06:09:38,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1617168.0, ans=0.2 2023-10-14 06:09:44,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1617214.6666666667, ans=0.125 2023-10-14 06:09:45,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1617214.6666666667, ans=0.2 2023-10-14 06:10:10,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1617308.0, ans=0.125 2023-10-14 06:10:32,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-10-14 06:10:38,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.924e+02 2.155e+02 2.412e+02 3.044e+02, threshold=4.311e+02, percent-clipped=0.0 2023-10-14 06:10:50,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1617494.6666666667, ans=0.125 2023-10-14 06:11:15,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=12.0 2023-10-14 06:11:34,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.38 vs. limit=5.0 2023-10-14 06:11:39,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1617634.6666666667, ans=0.07 2023-10-14 06:11:44,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1617681.3333333333, ans=0.0 2023-10-14 06:11:54,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1617728.0, ans=0.125 2023-10-14 06:12:04,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1617774.6666666667, ans=0.0 2023-10-14 06:12:40,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.716e+02 1.866e+02 2.062e+02 3.100e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-14 06:12:59,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1617961.3333333333, ans=0.125 2023-10-14 06:13:11,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-14 06:13:15,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1618054.6666666667, ans=0.125 2023-10-14 06:13:19,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-10-14 06:13:30,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.89 vs. limit=10.0 2023-10-14 06:13:49,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1618148.0, ans=0.125 2023-10-14 06:13:56,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1618194.6666666667, ans=0.125 2023-10-14 06:14:14,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1618288.0, ans=0.125 2023-10-14 06:14:18,022 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:14:30,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1618334.6666666667, ans=0.125 2023-10-14 06:14:33,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.697e+02 1.896e+02 2.092e+02 2.963e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-14 06:14:48,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1618428.0, ans=0.0 2023-10-14 06:14:51,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1618428.0, ans=0.1 2023-10-14 06:15:06,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-10-14 06:15:15,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1618521.3333333333, ans=0.0 2023-10-14 06:15:16,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1618521.3333333333, ans=0.07 2023-10-14 06:15:32,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1618614.6666666667, ans=0.07 2023-10-14 06:15:37,685 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-10-14 06:15:59,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1618754.6666666667, ans=0.2 2023-10-14 06:16:00,513 INFO [train.py:1031] (3/4) Epoch 26, batch 5500, loss[loss=0.2104, simple_loss=0.2922, pruned_loss=0.06435, over 16363.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2781, pruned_loss=0.04717, over 30710880.50 frames. ], batch size: 50, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 06:16:20,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.803e+02 1.924e+02 2.139e+02 2.974e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 06:16:27,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1618848.0, ans=0.2 2023-10-14 06:16:31,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-10-14 06:16:35,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1618894.6666666667, ans=0.125 2023-10-14 06:16:41,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.37 vs. limit=15.0 2023-10-14 06:16:53,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1618988.0, ans=0.2 2023-10-14 06:16:56,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1618988.0, ans=0.0 2023-10-14 06:17:01,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1619034.6666666667, ans=0.125 2023-10-14 06:17:17,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1619081.3333333333, ans=0.1 2023-10-14 06:17:26,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1619128.0, ans=0.0 2023-10-14 06:17:48,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1619221.3333333333, ans=10.0 2023-10-14 06:17:54,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1619221.3333333333, ans=0.05 2023-10-14 06:18:01,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1619268.0, ans=0.09899494936611666 2023-10-14 06:18:02,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-10-14 06:18:08,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.888e+02 2.132e+02 2.477e+02 4.389e+02, threshold=4.263e+02, percent-clipped=2.0 2023-10-14 06:18:20,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1619361.3333333333, ans=0.0 2023-10-14 06:18:24,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1619361.3333333333, ans=0.1 2023-10-14 06:18:32,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1619408.0, ans=0.2 2023-10-14 06:18:40,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1619454.6666666667, ans=0.125 2023-10-14 06:18:45,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2023-10-14 06:18:50,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1619501.3333333333, ans=0.1 2023-10-14 06:19:12,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1619548.0, ans=0.0 2023-10-14 06:19:19,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1619594.6666666667, ans=0.125 2023-10-14 06:19:32,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1619641.3333333333, ans=0.125 2023-10-14 06:19:41,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-10-14 06:19:49,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-14 06:19:55,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1619734.6666666667, ans=0.5 2023-10-14 06:20:06,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.896e+02 2.039e+02 2.267e+02 3.096e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-14 06:20:27,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1619874.6666666667, ans=0.2 2023-10-14 06:20:30,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.58 vs. limit=12.0 2023-10-14 06:20:56,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1619968.0, ans=0.125 2023-10-14 06:20:58,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1619968.0, ans=0.125 2023-10-14 06:20:59,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620014.6666666667, ans=0.1 2023-10-14 06:21:45,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1620154.6666666667, ans=0.0 2023-10-14 06:21:52,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1620201.3333333333, ans=0.125 2023-10-14 06:21:59,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1620248.0, ans=0.07 2023-10-14 06:22:01,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1620248.0, ans=0.125 2023-10-14 06:22:03,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.794e+02 1.935e+02 2.097e+02 2.642e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-14 06:22:07,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620248.0, ans=0.1 2023-10-14 06:22:07,500 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620248.0, ans=0.1 2023-10-14 06:22:08,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1620248.0, ans=0.125 2023-10-14 06:22:23,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1620341.3333333333, ans=0.1 2023-10-14 06:22:26,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1620341.3333333333, ans=0.125 2023-10-14 06:22:47,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.89 vs. limit=15.0 2023-10-14 06:23:11,444 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.67 vs. limit=15.0 2023-10-14 06:23:31,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620621.3333333333, ans=0.1 2023-10-14 06:23:34,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1620621.3333333333, ans=0.125 2023-10-14 06:23:39,295 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:23:40,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1620668.0, ans=0.1 2023-10-14 06:23:53,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.862e+02 2.022e+02 2.235e+02 3.358e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-14 06:24:00,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1620714.6666666667, ans=0.125 2023-10-14 06:24:09,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1620761.3333333333, ans=0.2 2023-10-14 06:24:14,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1620808.0, ans=0.125 2023-10-14 06:24:25,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1620854.6666666667, ans=0.125 2023-10-14 06:24:58,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1620994.6666666667, ans=0.125 2023-10-14 06:24:59,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=15.0 2023-10-14 06:25:10,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-10-14 06:25:15,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1621041.3333333333, ans=0.125 2023-10-14 06:25:17,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1621041.3333333333, ans=0.0 2023-10-14 06:25:21,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.25 vs. limit=10.0 2023-10-14 06:25:23,322 INFO [train.py:1031] (3/4) Epoch 26, batch 6000, loss[loss=0.1818, simple_loss=0.2816, pruned_loss=0.04095, over 16878.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2786, pruned_loss=0.04758, over 31184298.64 frames. ], batch size: 98, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 06:25:30,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1621088.0, ans=0.035 2023-10-14 06:25:38,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1621134.6666666667, ans=0.1 2023-10-14 06:25:43,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1621181.3333333333, ans=0.125 2023-10-14 06:25:46,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.875e+02 2.056e+02 2.258e+02 3.497e+02, threshold=4.111e+02, percent-clipped=0.0 2023-10-14 06:26:25,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-10-14 06:27:05,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1621508.0, ans=0.125 2023-10-14 06:27:24,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1621601.3333333333, ans=0.125 2023-10-14 06:27:25,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1621601.3333333333, ans=0.2 2023-10-14 06:27:36,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.796e+02 1.924e+02 2.109e+02 2.790e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-14 06:28:14,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1621788.0, ans=0.07 2023-10-14 06:28:23,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1621834.6666666667, ans=0.2 2023-10-14 06:28:26,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1621834.6666666667, ans=0.125 2023-10-14 06:28:45,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1621928.0, ans=0.125 2023-10-14 06:29:18,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1622068.0, ans=0.125 2023-10-14 06:29:27,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1622068.0, ans=0.0 2023-10-14 06:29:31,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.878e+02 2.080e+02 2.310e+02 3.226e+02, threshold=4.161e+02, percent-clipped=0.0 2023-10-14 06:30:00,357 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:30:07,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1622254.6666666667, ans=0.125 2023-10-14 06:30:16,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1622301.3333333333, ans=0.2 2023-10-14 06:30:24,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-10-14 06:30:28,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.99 vs. limit=22.5 2023-10-14 06:30:30,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.53 vs. limit=6.0 2023-10-14 06:30:35,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1622348.0, ans=0.09899494936611666 2023-10-14 06:30:36,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2023-10-14 06:30:43,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1622394.6666666667, ans=0.0 2023-10-14 06:30:46,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1622394.6666666667, ans=6.0 2023-10-14 06:31:04,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1622488.0, ans=0.125 2023-10-14 06:31:10,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1622534.6666666667, ans=0.125 2023-10-14 06:31:25,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.919e+02 2.102e+02 2.327e+02 2.844e+02, threshold=4.204e+02, percent-clipped=0.0 2023-10-14 06:31:27,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1622581.3333333333, ans=0.125 2023-10-14 06:32:29,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-10-14 06:32:47,782 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.38 vs. limit=15.0 2023-10-14 06:33:00,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1622954.6666666667, ans=0.125 2023-10-14 06:33:00,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1622954.6666666667, ans=0.125 2023-10-14 06:33:03,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-10-14 06:33:04,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1622954.6666666667, ans=0.125 2023-10-14 06:33:24,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1623048.0, ans=0.035 2023-10-14 06:33:24,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.749e+02 1.932e+02 2.162e+02 3.367e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 06:33:26,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1623048.0, ans=0.0 2023-10-14 06:34:20,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1623281.3333333333, ans=0.125 2023-10-14 06:34:31,562 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1623328.0, ans=0.1 2023-10-14 06:34:38,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1623328.0, ans=0.0 2023-10-14 06:34:42,642 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-10-14 06:34:45,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1623374.6666666667, ans=0.2 2023-10-14 06:34:52,063 INFO [train.py:1031] (3/4) Epoch 26, batch 6500, loss[loss=0.1811, simple_loss=0.2777, pruned_loss=0.04227, over 15222.00 frames. ], tot_loss[loss=0.187, simple_loss=0.279, pruned_loss=0.0475, over 31558680.77 frames. ], batch size: 35, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 06:34:52,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1623421.3333333333, ans=0.0 2023-10-14 06:35:05,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1623468.0, ans=0.125 2023-10-14 06:35:15,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1623468.0, ans=0.0 2023-10-14 06:35:23,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.852e+02 2.047e+02 2.249e+02 2.927e+02, threshold=4.094e+02, percent-clipped=0.0 2023-10-14 06:35:38,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1623561.3333333333, ans=0.125 2023-10-14 06:35:40,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1623561.3333333333, ans=0.125 2023-10-14 06:35:50,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=22.5 2023-10-14 06:35:51,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1623608.0, ans=0.125 2023-10-14 06:36:18,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1623701.3333333333, ans=0.125 2023-10-14 06:36:54,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1623888.0, ans=0.1 2023-10-14 06:36:54,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1623888.0, ans=0.125 2023-10-14 06:37:06,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1623934.6666666667, ans=0.0 2023-10-14 06:37:11,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1623934.6666666667, ans=0.0 2023-10-14 06:37:22,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.833e+02 2.027e+02 2.266e+02 3.623e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-14 06:37:29,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-14 06:37:39,728 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2023-10-14 06:37:41,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1624074.6666666667, ans=0.0 2023-10-14 06:37:48,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1624121.3333333333, ans=0.0 2023-10-14 06:38:13,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1624214.6666666667, ans=0.05 2023-10-14 06:38:23,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1624261.3333333333, ans=0.125 2023-10-14 06:38:37,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1624308.0, ans=0.125 2023-10-14 06:38:43,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1624354.6666666667, ans=0.125 2023-10-14 06:38:48,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1624354.6666666667, ans=0.125 2023-10-14 06:38:53,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1624401.3333333333, ans=0.125 2023-10-14 06:38:59,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1624401.3333333333, ans=0.1 2023-10-14 06:39:08,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.801e+02 1.976e+02 2.163e+02 3.359e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-14 06:39:17,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1624494.6666666667, ans=0.07 2023-10-14 06:39:28,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1624541.3333333333, ans=0.0 2023-10-14 06:39:40,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1624588.0, ans=0.1 2023-10-14 06:39:43,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1624588.0, ans=0.1 2023-10-14 06:39:57,350 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-10-14 06:40:03,327 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.38 vs. limit=12.0 2023-10-14 06:40:07,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1624681.3333333333, ans=0.0 2023-10-14 06:40:15,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1624728.0, ans=0.125 2023-10-14 06:40:20,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1624728.0, ans=0.07 2023-10-14 06:40:33,574 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-14 06:41:06,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1624868.0, ans=0.0 2023-10-14 06:41:19,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.814e+02 1.967e+02 2.232e+02 3.554e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 06:41:23,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-14 06:41:33,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1624961.3333333333, ans=0.1 2023-10-14 06:41:37,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625008.0, ans=0.1 2023-10-14 06:41:38,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1625008.0, ans=0.0 2023-10-14 06:41:48,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1625054.6666666667, ans=0.0 2023-10-14 06:41:50,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625054.6666666667, ans=0.125 2023-10-14 06:41:56,494 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-14 06:42:15,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625148.0, ans=0.1 2023-10-14 06:42:24,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1625194.6666666667, ans=0.2 2023-10-14 06:42:37,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1625241.3333333333, ans=0.2 2023-10-14 06:42:46,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625288.0, ans=0.1 2023-10-14 06:43:05,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625334.6666666667, ans=0.125 2023-10-14 06:43:10,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1625381.3333333333, ans=0.2 2023-10-14 06:43:13,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.724e+02 1.850e+02 2.104e+02 2.895e+02, threshold=3.699e+02, percent-clipped=0.0 2023-10-14 06:43:24,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625428.0, ans=0.1 2023-10-14 06:43:36,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625474.6666666667, ans=0.1 2023-10-14 06:43:37,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1625474.6666666667, ans=0.0 2023-10-14 06:43:42,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1625521.3333333333, ans=0.1 2023-10-14 06:43:47,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.58 vs. limit=15.0 2023-10-14 06:43:47,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1625521.3333333333, ans=0.125 2023-10-14 06:43:56,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1625568.0, ans=0.125 2023-10-14 06:44:29,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625708.0, ans=0.125 2023-10-14 06:44:33,756 INFO [train.py:1031] (3/4) Epoch 26, batch 7000, loss[loss=0.1929, simple_loss=0.2868, pruned_loss=0.04956, over 16851.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2794, pruned_loss=0.04734, over 31863406.11 frames. ], batch size: 155, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 06:44:36,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.71 vs. limit=15.0 2023-10-14 06:44:40,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1625754.6666666667, ans=0.125 2023-10-14 06:44:44,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1625754.6666666667, ans=0.0 2023-10-14 06:44:55,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1625801.3333333333, ans=0.95 2023-10-14 06:45:04,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.846e+02 2.034e+02 2.154e+02 3.020e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-14 06:45:43,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-10-14 06:46:14,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1626128.0, ans=0.1 2023-10-14 06:46:28,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1626221.3333333333, ans=0.125 2023-10-14 06:46:41,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1626268.0, ans=0.0 2023-10-14 06:46:56,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.908e+02 2.092e+02 2.328e+02 3.399e+02, threshold=4.183e+02, percent-clipped=0.0 2023-10-14 06:47:01,655 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2023-10-14 06:47:12,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1626361.3333333333, ans=0.1 2023-10-14 06:47:20,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-10-14 06:47:23,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1626408.0, ans=10.0 2023-10-14 06:47:23,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1626408.0, ans=0.0 2023-10-14 06:47:25,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1626454.6666666667, ans=0.0 2023-10-14 06:47:38,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=1626501.3333333333, ans=15.0 2023-10-14 06:47:40,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-10-14 06:47:44,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1626501.3333333333, ans=0.0 2023-10-14 06:47:56,698 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-10-14 06:48:03,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1626594.6666666667, ans=0.125 2023-10-14 06:48:57,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.749e+02 1.846e+02 2.013e+02 2.849e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-14 06:49:19,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1626874.6666666667, ans=0.125 2023-10-14 06:49:20,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-14 06:49:41,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1626968.0, ans=0.125 2023-10-14 06:49:59,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1627014.6666666667, ans=0.125 2023-10-14 06:50:07,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1627061.3333333333, ans=0.0 2023-10-14 06:50:19,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1627108.0, ans=0.0 2023-10-14 06:50:47,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1627201.3333333333, ans=0.125 2023-10-14 06:50:54,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-10-14 06:50:58,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.825e+02 1.930e+02 2.136e+02 3.235e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-14 06:50:59,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1627248.0, ans=0.125 2023-10-14 06:50:59,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-14 06:51:12,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1627294.6666666667, ans=0.0 2023-10-14 06:51:23,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1627341.3333333333, ans=0.1 2023-10-14 06:51:28,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1627388.0, ans=0.125 2023-10-14 06:51:31,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1627388.0, ans=0.0 2023-10-14 06:51:44,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1627434.6666666667, ans=0.125 2023-10-14 06:51:52,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=1627481.3333333333, ans=0.02 2023-10-14 06:52:09,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1627528.0, ans=0.125 2023-10-14 06:52:22,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=1627621.3333333333, ans=15.0 2023-10-14 06:52:35,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1627668.0, ans=0.0 2023-10-14 06:52:48,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.873e+02 2.049e+02 2.304e+02 3.204e+02, threshold=4.098e+02, percent-clipped=0.0 2023-10-14 06:52:54,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=22.5 2023-10-14 06:53:06,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-10-14 06:53:09,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1627808.0, ans=0.0 2023-10-14 06:53:16,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1627854.6666666667, ans=0.2 2023-10-14 06:53:25,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1627901.3333333333, ans=0.125 2023-10-14 06:53:34,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627948.0, ans=0.1 2023-10-14 06:53:38,372 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:53:39,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1627948.0, ans=0.125 2023-10-14 06:53:48,900 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.40 vs. limit=10.0 2023-10-14 06:54:06,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1628088.0, ans=0.09899494936611666 2023-10-14 06:54:07,126 INFO [train.py:1031] (3/4) Epoch 26, batch 7500, loss[loss=0.187, simple_loss=0.2853, pruned_loss=0.04436, over 16911.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2793, pruned_loss=0.0474, over 32050778.32 frames. ], batch size: 165, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 06:54:21,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1628134.6666666667, ans=0.125 2023-10-14 06:54:34,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.837e+02 1.980e+02 2.200e+02 4.370e+02, threshold=3.961e+02, percent-clipped=1.0 2023-10-14 06:54:52,120 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.65 vs. limit=22.5 2023-10-14 06:54:55,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1628274.6666666667, ans=0.0 2023-10-14 06:54:59,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.72 vs. limit=15.0 2023-10-14 06:55:10,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.31 vs. limit=22.5 2023-10-14 06:55:43,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1628461.3333333333, ans=0.125 2023-10-14 06:55:50,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1628508.0, ans=0.125 2023-10-14 06:56:03,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1628554.6666666667, ans=0.0 2023-10-14 06:56:07,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-14 06:56:12,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1628601.3333333333, ans=0.0 2023-10-14 06:56:12,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=15.0 2023-10-14 06:56:32,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.883e+02 2.051e+02 2.276e+02 3.072e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-14 06:56:37,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1628648.0, ans=0.125 2023-10-14 06:57:01,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1628741.3333333333, ans=0.1 2023-10-14 06:57:09,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1628788.0, ans=0.125 2023-10-14 06:57:18,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.79 vs. limit=10.0 2023-10-14 06:57:37,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1628881.3333333333, ans=0.125 2023-10-14 06:57:51,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1628928.0, ans=0.0 2023-10-14 06:58:23,486 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-10-14 06:58:34,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1629114.6666666667, ans=0.5 2023-10-14 06:58:35,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.783e+02 1.973e+02 2.220e+02 3.119e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 06:58:49,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-10-14 06:58:52,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1629208.0, ans=0.0 2023-10-14 06:58:52,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629208.0, ans=0.1 2023-10-14 06:59:05,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1629254.6666666667, ans=0.2 2023-10-14 06:59:11,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1629301.3333333333, ans=0.125 2023-10-14 06:59:15,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629301.3333333333, ans=0.1 2023-10-14 06:59:43,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1629394.6666666667, ans=0.0 2023-10-14 06:59:49,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1629441.3333333333, ans=0.125 2023-10-14 07:00:22,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1629581.3333333333, ans=0.0 2023-10-14 07:00:24,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2023-10-14 07:00:28,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.838e+02 2.031e+02 2.252e+02 3.390e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-14 07:00:31,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1629581.3333333333, ans=0.125 2023-10-14 07:00:42,114 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-14 07:00:50,967 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.95 vs. limit=15.0 2023-10-14 07:00:59,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629721.3333333333, ans=0.1 2023-10-14 07:01:09,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1629768.0, ans=0.0 2023-10-14 07:01:12,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629768.0, ans=0.1 2023-10-14 07:01:18,937 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-10-14 07:01:21,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1629814.6666666667, ans=0.02 2023-10-14 07:01:24,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1629814.6666666667, ans=0.125 2023-10-14 07:01:47,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1629908.0, ans=0.2 2023-10-14 07:02:03,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1629954.6666666667, ans=0.2 2023-10-14 07:02:19,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1630048.0, ans=0.025 2023-10-14 07:02:27,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.824e+02 1.978e+02 2.209e+02 2.929e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-14 07:02:36,688 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-10-14 07:02:39,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1630094.6666666667, ans=0.0 2023-10-14 07:02:41,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1630094.6666666667, ans=0.0 2023-10-14 07:02:48,568 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:03:00,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.53 vs. limit=6.0 2023-10-14 07:03:07,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1630234.6666666667, ans=0.0 2023-10-14 07:03:09,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-10-14 07:03:23,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.76 vs. limit=5.0 2023-10-14 07:03:25,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1630281.3333333333, ans=0.0 2023-10-14 07:03:27,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1630281.3333333333, ans=0.0 2023-10-14 07:03:27,872 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.44 vs. limit=22.5 2023-10-14 07:03:41,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1630328.0, ans=0.2 2023-10-14 07:03:44,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1630374.6666666667, ans=0.0 2023-10-14 07:03:44,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1630374.6666666667, ans=0.125 2023-10-14 07:03:45,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-14 07:03:54,392 INFO [train.py:1031] (3/4) Epoch 26, batch 8000, loss[loss=0.147, simple_loss=0.2456, pruned_loss=0.02416, over 16868.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2787, pruned_loss=0.0469, over 32202744.19 frames. ], batch size: 98, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 07:03:58,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1630421.3333333333, ans=0.0 2023-10-14 07:04:08,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1630468.0, ans=0.125 2023-10-14 07:04:08,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630468.0, ans=0.1 2023-10-14 07:04:19,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1630514.6666666667, ans=0.2 2023-10-14 07:04:22,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.677e+02 1.841e+02 2.094e+02 3.064e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-14 07:04:41,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1630608.0, ans=0.0 2023-10-14 07:04:43,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1630608.0, ans=0.04949747468305833 2023-10-14 07:04:44,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1630608.0, ans=0.125 2023-10-14 07:04:47,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1630654.6666666667, ans=0.0 2023-10-14 07:04:53,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1630654.6666666667, ans=0.0 2023-10-14 07:04:59,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1630701.3333333333, ans=0.125 2023-10-14 07:05:27,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1630794.6666666667, ans=0.015 2023-10-14 07:05:30,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1630794.6666666667, ans=0.0 2023-10-14 07:06:11,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.803e+02 1.945e+02 2.257e+02 2.887e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-14 07:06:50,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1631121.3333333333, ans=0.125 2023-10-14 07:07:03,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-10-14 07:07:04,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1631168.0, ans=0.2 2023-10-14 07:07:17,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1631168.0, ans=0.2 2023-10-14 07:07:24,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1631214.6666666667, ans=0.125 2023-10-14 07:07:48,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1631308.0, ans=0.2 2023-10-14 07:08:08,023 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1631401.3333333333, ans=0.125 2023-10-14 07:08:08,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1631401.3333333333, ans=0.125 2023-10-14 07:08:23,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.825e+02 2.035e+02 2.254e+02 3.659e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-14 07:08:26,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1631448.0, ans=0.125 2023-10-14 07:08:43,257 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:08:47,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1631541.3333333333, ans=0.1 2023-10-14 07:08:50,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1631588.0, ans=0.1 2023-10-14 07:08:59,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1631588.0, ans=0.125 2023-10-14 07:09:01,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1631634.6666666667, ans=0.1 2023-10-14 07:09:07,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-10-14 07:09:27,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1631728.0, ans=0.04949747468305833 2023-10-14 07:10:20,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.822e+02 1.986e+02 2.191e+02 2.957e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-14 07:10:23,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1631961.3333333333, ans=0.125 2023-10-14 07:10:41,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1632008.0, ans=0.125 2023-10-14 07:10:55,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1632101.3333333333, ans=0.0 2023-10-14 07:11:10,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.52 vs. limit=15.0 2023-10-14 07:11:30,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1632241.3333333333, ans=0.2 2023-10-14 07:11:49,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632288.0, ans=0.1 2023-10-14 07:11:51,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1632288.0, ans=0.125 2023-10-14 07:11:53,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1632334.6666666667, ans=0.2 2023-10-14 07:12:13,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.852e+02 1.974e+02 2.134e+02 3.639e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-14 07:12:21,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-10-14 07:12:29,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1632474.6666666667, ans=0.2 2023-10-14 07:12:33,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1632474.6666666667, ans=0.0 2023-10-14 07:12:48,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1632521.3333333333, ans=0.0 2023-10-14 07:12:50,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1632568.0, ans=0.0 2023-10-14 07:12:54,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1632568.0, ans=0.0 2023-10-14 07:12:57,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1632568.0, ans=0.0 2023-10-14 07:13:04,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1632614.6666666667, ans=0.0 2023-10-14 07:13:34,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1632708.0, ans=0.2 2023-10-14 07:13:36,402 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.22 vs. limit=10.0 2023-10-14 07:13:37,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1632708.0, ans=0.125 2023-10-14 07:13:41,152 INFO [train.py:1031] (3/4) Epoch 26, batch 8500, loss[loss=0.1722, simple_loss=0.2755, pruned_loss=0.03445, over 16870.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2789, pruned_loss=0.04679, over 32334832.87 frames. ], batch size: 104, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 07:13:52,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1632801.3333333333, ans=0.125 2023-10-14 07:14:12,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.924e+02 2.121e+02 2.299e+02 3.263e+02, threshold=4.243e+02, percent-clipped=0.0 2023-10-14 07:14:15,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1632894.6666666667, ans=0.0 2023-10-14 07:14:33,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2023-10-14 07:14:45,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1632988.0, ans=0.09899494936611666 2023-10-14 07:14:49,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1633034.6666666667, ans=0.0 2023-10-14 07:15:19,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1633128.0, ans=0.0 2023-10-14 07:15:33,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1633174.6666666667, ans=0.0 2023-10-14 07:15:36,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1633174.6666666667, ans=0.1 2023-10-14 07:15:42,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1633221.3333333333, ans=0.125 2023-10-14 07:15:42,275 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-14 07:15:53,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1633268.0, ans=0.125 2023-10-14 07:15:58,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1633268.0, ans=0.05 2023-10-14 07:16:07,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1633314.6666666667, ans=0.2 2023-10-14 07:16:13,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1633314.6666666667, ans=0.0 2023-10-14 07:16:13,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.783e+02 1.985e+02 2.199e+02 3.135e+02, threshold=3.970e+02, percent-clipped=0.0 2023-10-14 07:16:22,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1633361.3333333333, ans=0.2 2023-10-14 07:16:24,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.44 vs. limit=15.0 2023-10-14 07:16:30,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1633408.0, ans=0.0 2023-10-14 07:16:48,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1633454.6666666667, ans=0.125 2023-10-14 07:16:55,553 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-14 07:16:56,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1633501.3333333333, ans=0.125 2023-10-14 07:17:19,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-10-14 07:17:26,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1633594.6666666667, ans=0.125 2023-10-14 07:18:07,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1633781.3333333333, ans=0.0 2023-10-14 07:18:09,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.57 vs. limit=15.0 2023-10-14 07:18:12,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.720e+02 1.918e+02 2.109e+02 2.890e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-14 07:18:20,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1633828.0, ans=0.5 2023-10-14 07:18:29,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1633874.6666666667, ans=0.125 2023-10-14 07:18:34,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1633874.6666666667, ans=0.125 2023-10-14 07:18:47,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1633921.3333333333, ans=0.125 2023-10-14 07:18:48,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1633921.3333333333, ans=0.09899494936611666 2023-10-14 07:18:50,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1633921.3333333333, ans=0.125 2023-10-14 07:19:05,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1633968.0, ans=10.0 2023-10-14 07:19:19,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1634061.3333333333, ans=0.125 2023-10-14 07:19:20,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1634061.3333333333, ans=0.125 2023-10-14 07:19:29,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1634061.3333333333, ans=0.0 2023-10-14 07:19:36,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1634108.0, ans=0.2 2023-10-14 07:19:36,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1634108.0, ans=0.07 2023-10-14 07:19:44,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.86 vs. limit=15.0 2023-10-14 07:20:05,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-10-14 07:20:10,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=22.5 2023-10-14 07:20:11,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.785e+02 1.870e+02 2.236e+02 3.649e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-14 07:20:15,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-10-14 07:20:56,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634481.3333333333, ans=0.1 2023-10-14 07:20:56,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1634481.3333333333, ans=0.0 2023-10-14 07:20:58,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1634481.3333333333, ans=0.1 2023-10-14 07:21:18,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1634574.6666666667, ans=0.125 2023-10-14 07:21:33,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1634621.3333333333, ans=0.125 2023-10-14 07:21:45,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-10-14 07:21:47,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1634668.0, ans=0.125 2023-10-14 07:21:50,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1634714.6666666667, ans=0.0 2023-10-14 07:21:58,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1634714.6666666667, ans=0.0 2023-10-14 07:22:00,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.838e+02 2.063e+02 2.261e+02 3.155e+02, threshold=4.125e+02, percent-clipped=0.0 2023-10-14 07:22:05,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634761.3333333333, ans=0.1 2023-10-14 07:22:10,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1634808.0, ans=0.125 2023-10-14 07:22:44,224 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-10-14 07:22:52,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1634948.0, ans=0.125 2023-10-14 07:23:17,750 INFO [train.py:1031] (3/4) Epoch 26, batch 9000, loss[loss=0.1767, simple_loss=0.2722, pruned_loss=0.04064, over 16600.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2783, pruned_loss=0.04655, over 32441680.50 frames. ], batch size: 56, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 07:23:32,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=22.5 2023-10-14 07:23:49,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.838e+02 2.026e+02 2.317e+02 2.925e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-14 07:23:53,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1635228.0, ans=0.025 2023-10-14 07:23:54,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1635228.0, ans=0.2 2023-10-14 07:24:12,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1635321.3333333333, ans=0.125 2023-10-14 07:24:40,538 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=12.0 2023-10-14 07:24:41,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-10-14 07:25:08,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1635554.6666666667, ans=0.0 2023-10-14 07:25:19,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1635601.3333333333, ans=0.0 2023-10-14 07:25:26,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-14 07:25:33,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1635648.0, ans=0.015 2023-10-14 07:25:35,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.773e+02 1.933e+02 2.293e+02 3.051e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-14 07:25:42,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1635694.6666666667, ans=0.125 2023-10-14 07:25:43,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1635694.6666666667, ans=0.125 2023-10-14 07:26:02,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1635788.0, ans=0.125 2023-10-14 07:26:02,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1635788.0, ans=0.125 2023-10-14 07:26:17,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1635881.3333333333, ans=0.0 2023-10-14 07:26:30,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1635928.0, ans=0.0 2023-10-14 07:26:44,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1635974.6666666667, ans=0.125 2023-10-14 07:26:55,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1636021.3333333333, ans=0.125 2023-10-14 07:26:58,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-10-14 07:27:08,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1636114.6666666667, ans=0.125 2023-10-14 07:27:10,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1636114.6666666667, ans=0.035 2023-10-14 07:27:13,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1636114.6666666667, ans=0.0 2023-10-14 07:27:17,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.854e+02 1.969e+02 2.199e+02 3.397e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-14 07:27:34,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1636208.0, ans=0.125 2023-10-14 07:27:47,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636254.6666666667, ans=0.1 2023-10-14 07:27:58,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636301.3333333333, ans=0.1 2023-10-14 07:27:58,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1636301.3333333333, ans=0.125 2023-10-14 07:27:59,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1636301.3333333333, ans=0.2 2023-10-14 07:28:00,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636348.0, ans=0.1 2023-10-14 07:28:16,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636394.6666666667, ans=0.1 2023-10-14 07:28:38,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1636488.0, ans=0.2 2023-10-14 07:28:39,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1636488.0, ans=0.0 2023-10-14 07:28:47,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1636534.6666666667, ans=0.125 2023-10-14 07:28:49,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1636534.6666666667, ans=0.1 2023-10-14 07:29:02,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1636581.3333333333, ans=15.0 2023-10-14 07:29:02,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1636581.3333333333, ans=0.125 2023-10-14 07:29:04,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.909e+02 2.128e+02 2.349e+02 3.082e+02, threshold=4.255e+02, percent-clipped=0.0 2023-10-14 07:29:05,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1636628.0, ans=0.0 2023-10-14 07:29:34,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1636721.3333333333, ans=0.025 2023-10-14 07:29:37,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1636721.3333333333, ans=0.0 2023-10-14 07:29:37,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636721.3333333333, ans=0.1 2023-10-14 07:30:26,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1636908.0, ans=0.125 2023-10-14 07:30:48,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1637001.3333333333, ans=0.125 2023-10-14 07:31:02,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1637048.0, ans=0.0 2023-10-14 07:31:07,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.775e+02 1.959e+02 2.152e+02 2.918e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 07:31:10,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1637094.6666666667, ans=0.0 2023-10-14 07:31:24,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1637141.3333333333, ans=0.125 2023-10-14 07:31:32,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1637188.0, ans=0.0 2023-10-14 07:31:33,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1637188.0, ans=0.125 2023-10-14 07:31:34,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1637188.0, ans=0.0 2023-10-14 07:31:40,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1637188.0, ans=0.125 2023-10-14 07:31:44,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1637234.6666666667, ans=0.125 2023-10-14 07:31:45,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1637234.6666666667, ans=0.125 2023-10-14 07:31:46,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1637234.6666666667, ans=0.125 2023-10-14 07:31:49,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1637234.6666666667, ans=0.0 2023-10-14 07:32:03,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1637281.3333333333, ans=0.125 2023-10-14 07:32:12,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1637328.0, ans=0.125 2023-10-14 07:32:16,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1637328.0, ans=0.125 2023-10-14 07:32:34,301 INFO [train.py:1031] (3/4) Epoch 26, batch 9500, loss[loss=0.179, simple_loss=0.2703, pruned_loss=0.04382, over 16440.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2791, pruned_loss=0.04687, over 32542300.40 frames. ], batch size: 50, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 07:32:42,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-10-14 07:32:48,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1637468.0, ans=0.2 2023-10-14 07:32:48,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-14 07:33:04,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.857e+02 2.010e+02 2.230e+02 3.118e+02, threshold=4.020e+02, percent-clipped=0.0 2023-10-14 07:33:10,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=22.5 2023-10-14 07:33:11,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637561.3333333333, ans=0.1 2023-10-14 07:33:12,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1637561.3333333333, ans=0.125 2023-10-14 07:33:15,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1637561.3333333333, ans=0.0 2023-10-14 07:34:07,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1637794.6666666667, ans=0.0 2023-10-14 07:34:10,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1637794.6666666667, ans=0.0 2023-10-14 07:34:22,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1637888.0, ans=0.2 2023-10-14 07:34:36,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1637934.6666666667, ans=0.125 2023-10-14 07:34:37,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1637934.6666666667, ans=0.125 2023-10-14 07:34:39,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1637934.6666666667, ans=0.0 2023-10-14 07:34:42,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637934.6666666667, ans=0.1 2023-10-14 07:34:47,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.00 vs. limit=22.5 2023-10-14 07:34:55,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.830e+02 1.975e+02 2.227e+02 3.313e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 07:35:04,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-10-14 07:35:09,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1638074.6666666667, ans=0.95 2023-10-14 07:35:10,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1638074.6666666667, ans=0.2 2023-10-14 07:35:11,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1638074.6666666667, ans=0.05 2023-10-14 07:35:15,860 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.62 vs. limit=6.0 2023-10-14 07:35:44,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1638214.6666666667, ans=0.0 2023-10-14 07:35:49,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.28 vs. limit=10.0 2023-10-14 07:35:53,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1638214.6666666667, ans=0.125 2023-10-14 07:35:59,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-10-14 07:36:10,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1638308.0, ans=0.0 2023-10-14 07:36:22,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1638354.6666666667, ans=0.04949747468305833 2023-10-14 07:36:28,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1638401.3333333333, ans=0.0 2023-10-14 07:36:34,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-10-14 07:36:37,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1638448.0, ans=0.0 2023-10-14 07:36:47,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.883e+02 2.041e+02 2.312e+02 4.063e+02, threshold=4.082e+02, percent-clipped=1.0 2023-10-14 07:37:04,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.21 vs. limit=22.5 2023-10-14 07:37:04,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1638541.3333333333, ans=0.2 2023-10-14 07:37:17,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1638588.0, ans=0.0 2023-10-14 07:37:39,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.06 vs. limit=15.0 2023-10-14 07:38:09,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1638821.3333333333, ans=0.125 2023-10-14 07:38:12,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1638821.3333333333, ans=0.2 2023-10-14 07:38:13,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638821.3333333333, ans=0.1 2023-10-14 07:38:16,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1638821.3333333333, ans=0.125 2023-10-14 07:38:39,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.796e+02 1.968e+02 2.217e+02 3.431e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-14 07:39:03,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1639054.6666666667, ans=0.125 2023-10-14 07:39:39,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1639194.6666666667, ans=0.0 2023-10-14 07:39:57,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1639288.0, ans=0.125 2023-10-14 07:40:07,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1639288.0, ans=0.5 2023-10-14 07:40:09,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1639334.6666666667, ans=0.1 2023-10-14 07:40:11,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1639334.6666666667, ans=0.125 2023-10-14 07:40:11,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1639334.6666666667, ans=0.125 2023-10-14 07:40:15,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1639334.6666666667, ans=0.0 2023-10-14 07:40:28,328 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=22.5 2023-10-14 07:40:30,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.890e+02 2.050e+02 2.341e+02 3.194e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-14 07:40:34,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1639428.0, ans=0.1 2023-10-14 07:40:47,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1639474.6666666667, ans=0.125 2023-10-14 07:41:00,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1639521.3333333333, ans=0.0 2023-10-14 07:41:05,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1639568.0, ans=0.125 2023-10-14 07:41:12,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1639614.6666666667, ans=0.125 2023-10-14 07:41:13,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1639614.6666666667, ans=10.0 2023-10-14 07:41:28,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1639661.3333333333, ans=0.125 2023-10-14 07:41:38,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1639708.0, ans=0.125 2023-10-14 07:41:38,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1639708.0, ans=0.0 2023-10-14 07:41:43,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1639708.0, ans=0.0 2023-10-14 07:41:44,969 INFO [train.py:1031] (3/4) Epoch 26, batch 10000, loss[loss=0.1908, simple_loss=0.285, pruned_loss=0.04825, over 16877.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2782, pruned_loss=0.04673, over 32575111.57 frames. ], batch size: 110, lr: 1.31e-03, grad_scale: 16.0 2023-10-14 07:41:58,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1639801.3333333333, ans=0.125 2023-10-14 07:42:01,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1639801.3333333333, ans=0.0 2023-10-14 07:42:18,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.836e+02 1.952e+02 2.087e+02 2.619e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-14 07:42:30,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1639941.3333333333, ans=0.0 2023-10-14 07:42:45,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1639988.0, ans=0.125 2023-10-14 07:42:51,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1640034.6666666667, ans=0.0 2023-10-14 07:42:54,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640034.6666666667, ans=0.125 2023-10-14 07:43:04,360 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.67 vs. limit=15.0 2023-10-14 07:43:05,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1640081.3333333333, ans=0.125 2023-10-14 07:43:34,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1640174.6666666667, ans=0.125 2023-10-14 07:43:52,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1640268.0, ans=0.125 2023-10-14 07:43:54,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1640268.0, ans=0.1 2023-10-14 07:44:00,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1640314.6666666667, ans=0.0 2023-10-14 07:44:10,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1640361.3333333333, ans=0.2 2023-10-14 07:44:13,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.803e+02 1.982e+02 2.219e+02 2.772e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 07:44:21,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640408.0, ans=0.125 2023-10-14 07:44:33,620 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=12.0 2023-10-14 07:44:46,580 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=12.0 2023-10-14 07:44:49,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1640501.3333333333, ans=0.0 2023-10-14 07:44:53,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1640548.0, ans=0.125 2023-10-14 07:45:00,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1640548.0, ans=0.1 2023-10-14 07:45:03,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-10-14 07:45:30,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640688.0, ans=0.125 2023-10-14 07:45:31,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1640688.0, ans=0.125 2023-10-14 07:45:55,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1640781.3333333333, ans=0.125 2023-10-14 07:46:05,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1640828.0, ans=0.02 2023-10-14 07:46:09,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.856e+02 2.039e+02 2.307e+02 3.070e+02, threshold=4.077e+02, percent-clipped=0.0 2023-10-14 07:46:32,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640921.3333333333, ans=0.1 2023-10-14 07:46:33,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1640921.3333333333, ans=0.125 2023-10-14 07:46:36,189 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=15.0 2023-10-14 07:46:44,273 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-14 07:46:52,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1641014.6666666667, ans=0.0 2023-10-14 07:47:04,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1641061.3333333333, ans=0.0 2023-10-14 07:47:15,075 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-10-14 07:47:48,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.46 vs. limit=22.5 2023-10-14 07:48:02,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1641294.6666666667, ans=0.0 2023-10-14 07:48:05,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.768e+02 1.967e+02 2.199e+02 3.248e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-14 07:48:49,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1641481.3333333333, ans=0.035 2023-10-14 07:48:49,537 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.38 vs. limit=15.0 2023-10-14 07:49:14,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1641574.6666666667, ans=0.125 2023-10-14 07:49:40,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1641668.0, ans=0.1 2023-10-14 07:49:59,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1641761.3333333333, ans=0.0 2023-10-14 07:50:01,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.846e+02 2.087e+02 2.315e+02 3.353e+02, threshold=4.174e+02, percent-clipped=0.0 2023-10-14 07:50:09,354 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.78 vs. limit=10.0 2023-10-14 07:50:28,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1641854.6666666667, ans=0.0 2023-10-14 07:50:55,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641994.6666666667, ans=0.1 2023-10-14 07:51:05,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1642041.3333333333, ans=0.125 2023-10-14 07:51:12,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1642041.3333333333, ans=0.125 2023-10-14 07:51:17,730 INFO [train.py:1031] (3/4) Epoch 26, batch 10500, loss[loss=0.1837, simple_loss=0.2826, pruned_loss=0.04238, over 16865.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2788, pruned_loss=0.0468, over 32637918.28 frames. ], batch size: 72, lr: 1.31e-03, grad_scale: 8.0 2023-10-14 07:51:17,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1642088.0, ans=0.125 2023-10-14 07:51:26,021 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:51:26,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1642088.0, ans=0.0 2023-10-14 07:51:29,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1642134.6666666667, ans=0.1 2023-10-14 07:51:46,546 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:51:51,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.769e+02 1.926e+02 2.128e+02 3.242e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-14 07:51:54,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1642228.0, ans=10.0 2023-10-14 07:52:11,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1642321.3333333333, ans=0.0 2023-10-14 07:52:12,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1642321.3333333333, ans=0.125 2023-10-14 07:52:14,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1642321.3333333333, ans=0.0 2023-10-14 07:52:35,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1642414.6666666667, ans=0.0 2023-10-14 07:52:48,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1642414.6666666667, ans=0.0 2023-10-14 07:52:52,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-14 07:52:54,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2023-10-14 07:53:09,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-10-14 07:53:11,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-10-14 07:53:33,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1642601.3333333333, ans=0.2 2023-10-14 07:53:37,291 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=15.0 2023-10-14 07:53:39,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1642648.0, ans=0.0 2023-10-14 07:54:00,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.832e+02 2.023e+02 2.207e+02 2.917e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-14 07:54:21,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642788.0, ans=0.1 2023-10-14 07:54:29,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1642834.6666666667, ans=0.125 2023-10-14 07:54:37,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1642834.6666666667, ans=0.0 2023-10-14 07:54:58,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1642928.0, ans=0.125 2023-10-14 07:55:02,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1642928.0, ans=0.125 2023-10-14 07:55:10,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1642974.6666666667, ans=0.125 2023-10-14 07:55:12,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1642974.6666666667, ans=0.125 2023-10-14 07:55:15,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1643021.3333333333, ans=0.07 2023-10-14 07:55:25,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1643021.3333333333, ans=0.125 2023-10-14 07:55:37,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1643114.6666666667, ans=0.125 2023-10-14 07:55:37,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-10-14 07:55:39,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1643114.6666666667, ans=0.0 2023-10-14 07:55:42,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1643114.6666666667, ans=10.0 2023-10-14 07:55:46,571 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=15.0 2023-10-14 07:55:51,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1643161.3333333333, ans=0.0 2023-10-14 07:55:53,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.821e+02 1.959e+02 2.139e+02 2.735e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 07:55:55,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1643161.3333333333, ans=0.125 2023-10-14 07:55:57,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1643161.3333333333, ans=0.1 2023-10-14 07:56:06,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1643208.0, ans=0.125 2023-10-14 07:56:24,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1643301.3333333333, ans=0.125 2023-10-14 07:56:25,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1643301.3333333333, ans=0.125 2023-10-14 07:56:31,282 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-10-14 07:56:35,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1643348.0, ans=0.125 2023-10-14 07:56:51,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1643394.6666666667, ans=15.0 2023-10-14 07:56:56,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643394.6666666667, ans=0.1 2023-10-14 07:57:41,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1643628.0, ans=0.2 2023-10-14 07:57:44,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.874e+02 2.044e+02 2.319e+02 3.357e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-14 07:57:50,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1643674.6666666667, ans=0.0 2023-10-14 07:58:05,726 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-10-14 07:58:18,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1643768.0, ans=0.035 2023-10-14 07:58:21,637 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.80 vs. limit=15.0 2023-10-14 07:58:29,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1643814.6666666667, ans=0.0 2023-10-14 07:58:30,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1643814.6666666667, ans=0.125 2023-10-14 07:58:37,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1643861.3333333333, ans=0.2 2023-10-14 07:58:38,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1643861.3333333333, ans=0.1 2023-10-14 07:58:43,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1643861.3333333333, ans=0.125 2023-10-14 07:58:50,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1643908.0, ans=0.0 2023-10-14 07:58:51,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1643908.0, ans=0.0 2023-10-14 07:59:10,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1644001.3333333333, ans=0.125 2023-10-14 07:59:19,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1644048.0, ans=0.125 2023-10-14 07:59:31,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.804e+02 1.970e+02 2.179e+02 3.025e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-14 07:59:38,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1644094.6666666667, ans=0.2 2023-10-14 07:59:40,364 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:59:42,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.48 vs. limit=15.0 2023-10-14 07:59:49,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-14 07:59:52,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1644188.0, ans=0.07 2023-10-14 07:59:54,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1644188.0, ans=0.0 2023-10-14 08:00:06,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1644234.6666666667, ans=0.025 2023-10-14 08:00:09,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1644234.6666666667, ans=0.125 2023-10-14 08:00:18,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.63 vs. limit=15.0 2023-10-14 08:00:32,410 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.074e-03 2023-10-14 08:00:36,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1644374.6666666667, ans=0.125 2023-10-14 08:00:38,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-10-14 08:00:43,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-14 08:00:44,125 INFO [train.py:1031] (3/4) Epoch 26, batch 11000, loss[loss=0.1788, simple_loss=0.2742, pruned_loss=0.0417, over 16628.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2787, pruned_loss=0.0468, over 32682892.88 frames. ], batch size: 66, lr: 1.31e-03, grad_scale: 16.0 2023-10-14 08:00:56,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1644468.0, ans=0.125 2023-10-14 08:01:20,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.881e+02 2.042e+02 2.239e+02 3.231e+02, threshold=4.085e+02, percent-clipped=0.0 2023-10-14 08:01:54,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1644701.3333333333, ans=0.5 2023-10-14 08:02:10,452 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-10-14 08:02:13,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1644748.0, ans=0.125 2023-10-14 08:02:20,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1644794.6666666667, ans=0.125 2023-10-14 08:02:25,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1644841.3333333333, ans=0.0 2023-10-14 08:02:45,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1644888.0, ans=0.07 2023-10-14 08:02:58,386 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-10-14 08:03:04,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1644934.6666666667, ans=0.2 2023-10-14 08:03:04,579 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.16 vs. limit=15.0 2023-10-14 08:03:10,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1644981.3333333333, ans=0.125 2023-10-14 08:03:22,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.812e+02 1.960e+02 2.191e+02 2.946e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-14 08:03:49,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1645121.3333333333, ans=0.125 2023-10-14 08:04:02,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1645168.0, ans=0.125 2023-10-14 08:04:03,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1645168.0, ans=0.1 2023-10-14 08:04:07,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645214.6666666667, ans=0.1 2023-10-14 08:04:17,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1645261.3333333333, ans=0.1 2023-10-14 08:04:31,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1645308.0, ans=0.025 2023-10-14 08:04:38,530 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.29 vs. limit=15.0 2023-10-14 08:04:40,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-10-14 08:04:43,136 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.55 vs. limit=15.0 2023-10-14 08:04:44,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1645354.6666666667, ans=0.125 2023-10-14 08:05:14,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.744e+02 1.943e+02 2.090e+02 2.856e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-14 08:05:16,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1645494.6666666667, ans=0.0 2023-10-14 08:05:17,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.87 vs. limit=15.0 2023-10-14 08:05:23,195 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1645541.3333333333, ans=0.0 2023-10-14 08:05:29,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1645541.3333333333, ans=0.125 2023-10-14 08:05:32,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1645588.0, ans=0.1 2023-10-14 08:05:56,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-10-14 08:06:35,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1645821.3333333333, ans=0.125 2023-10-14 08:06:55,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1645914.6666666667, ans=0.0 2023-10-14 08:06:59,211 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.78 vs. limit=10.0 2023-10-14 08:07:06,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.788e+02 1.937e+02 2.194e+02 3.413e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-14 08:07:24,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1646008.0, ans=0.0 2023-10-14 08:07:29,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-10-14 08:07:30,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1646054.6666666667, ans=0.125 2023-10-14 08:07:30,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1646054.6666666667, ans=0.125 2023-10-14 08:07:37,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-10-14 08:07:49,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1646101.3333333333, ans=0.0 2023-10-14 08:07:57,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1646148.0, ans=0.125 2023-10-14 08:08:01,623 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:08:10,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1646194.6666666667, ans=0.125 2023-10-14 08:08:24,206 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=22.5 2023-10-14 08:08:37,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1646334.6666666667, ans=0.2 2023-10-14 08:08:47,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1646381.3333333333, ans=0.2 2023-10-14 08:08:59,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.863e+02 2.076e+02 2.399e+02 3.621e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-14 08:09:19,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-14 08:09:34,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=1646568.0, ans=6.0 2023-10-14 08:09:54,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1646661.3333333333, ans=0.125 2023-10-14 08:10:05,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1646708.0, ans=0.125 2023-10-14 08:10:13,079 INFO [train.py:1031] (3/4) Epoch 26, batch 11500, loss[loss=0.1878, simple_loss=0.2794, pruned_loss=0.04804, over 16869.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2783, pruned_loss=0.04669, over 32693330.18 frames. ], batch size: 77, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:10:17,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1646754.6666666667, ans=0.125 2023-10-14 08:10:23,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1646801.3333333333, ans=0.125 2023-10-14 08:10:26,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1646801.3333333333, ans=0.0 2023-10-14 08:10:30,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-10-14 08:10:33,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1646848.0, ans=0.0 2023-10-14 08:10:44,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-10-14 08:10:48,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.928e+02 2.084e+02 2.340e+02 3.068e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-14 08:11:06,128 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2023-10-14 08:11:08,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1646941.3333333333, ans=0.125 2023-10-14 08:11:16,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1646988.0, ans=0.0 2023-10-14 08:11:35,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647081.3333333333, ans=0.1 2023-10-14 08:11:37,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1647081.3333333333, ans=0.125 2023-10-14 08:11:41,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1647081.3333333333, ans=0.1 2023-10-14 08:11:54,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-14 08:12:12,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=22.5 2023-10-14 08:12:19,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1647221.3333333333, ans=0.2 2023-10-14 08:12:26,912 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-10-14 08:12:39,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1647314.6666666667, ans=0.025 2023-10-14 08:12:45,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.725e+02 1.859e+02 2.083e+02 2.852e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-14 08:13:05,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1647454.6666666667, ans=0.0 2023-10-14 08:13:25,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1647548.0, ans=0.0 2023-10-14 08:13:32,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-10-14 08:13:32,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1647594.6666666667, ans=10.0 2023-10-14 08:13:34,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1647594.6666666667, ans=0.0 2023-10-14 08:13:54,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1647688.0, ans=0.0 2023-10-14 08:14:15,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.17 vs. limit=15.0 2023-10-14 08:14:19,083 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.54 vs. limit=15.0 2023-10-14 08:14:20,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1647781.3333333333, ans=0.1 2023-10-14 08:14:28,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1647828.0, ans=0.125 2023-10-14 08:14:30,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.793e+02 1.928e+02 2.096e+02 2.709e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-14 08:14:58,236 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1647921.3333333333, ans=0.0 2023-10-14 08:15:55,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-10-14 08:16:04,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1648154.6666666667, ans=0.125 2023-10-14 08:16:20,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1648201.3333333333, ans=0.0 2023-10-14 08:16:29,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.90 vs. limit=15.0 2023-10-14 08:16:39,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.781e+02 1.932e+02 2.171e+02 3.419e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 08:16:39,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1648294.6666666667, ans=0.125 2023-10-14 08:16:40,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1648294.6666666667, ans=0.125 2023-10-14 08:16:45,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-10-14 08:16:50,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1648341.3333333333, ans=0.125 2023-10-14 08:17:30,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1648481.3333333333, ans=0.125 2023-10-14 08:17:49,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1648574.6666666667, ans=0.0 2023-10-14 08:17:53,271 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:17:53,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1648574.6666666667, ans=0.0 2023-10-14 08:17:54,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1648574.6666666667, ans=0.125 2023-10-14 08:18:01,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1648621.3333333333, ans=0.0 2023-10-14 08:18:03,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1648621.3333333333, ans=0.0 2023-10-14 08:18:04,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1648621.3333333333, ans=0.0 2023-10-14 08:18:12,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1648668.0, ans=0.125 2023-10-14 08:18:34,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.804e+02 1.962e+02 2.291e+02 3.258e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-14 08:18:40,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.61 vs. limit=15.0 2023-10-14 08:19:00,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.24 vs. limit=15.0 2023-10-14 08:19:01,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648854.6666666667, ans=0.1 2023-10-14 08:19:20,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1648948.0, ans=0.2 2023-10-14 08:19:40,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1649041.3333333333, ans=0.125 2023-10-14 08:19:45,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.82 vs. limit=22.5 2023-10-14 08:19:52,692 INFO [train.py:1031] (3/4) Epoch 26, batch 12000, loss[loss=0.1932, simple_loss=0.2854, pruned_loss=0.05054, over 16850.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2782, pruned_loss=0.04647, over 32710152.22 frames. ], batch size: 188, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:19:57,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=15.0 2023-10-14 08:19:59,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1649088.0, ans=0.0 2023-10-14 08:19:59,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-10-14 08:20:11,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=12.0 2023-10-14 08:20:31,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.829e+02 2.016e+02 2.256e+02 2.931e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-14 08:20:38,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1649274.6666666667, ans=0.1 2023-10-14 08:20:50,437 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=22.5 2023-10-14 08:21:05,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-10-14 08:21:16,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1649414.6666666667, ans=0.05 2023-10-14 08:21:20,171 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.48 vs. limit=12.0 2023-10-14 08:21:22,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1649414.6666666667, ans=0.2 2023-10-14 08:21:50,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.14 vs. limit=22.5 2023-10-14 08:22:01,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649601.3333333333, ans=0.1 2023-10-14 08:22:06,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1649648.0, ans=0.0 2023-10-14 08:22:11,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1649648.0, ans=0.125 2023-10-14 08:22:15,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1649694.6666666667, ans=0.125 2023-10-14 08:22:19,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.761e+02 1.886e+02 2.075e+02 3.034e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-14 08:22:42,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1649788.0, ans=0.125 2023-10-14 08:22:43,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1649788.0, ans=0.0 2023-10-14 08:22:46,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1649834.6666666667, ans=0.125 2023-10-14 08:22:52,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1649834.6666666667, ans=0.1 2023-10-14 08:23:03,826 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:24:07,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.875e+02 2.070e+02 2.318e+02 3.312e+02, threshold=4.141e+02, percent-clipped=0.0 2023-10-14 08:24:13,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1650208.0, ans=0.1 2023-10-14 08:24:18,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1650208.0, ans=0.0 2023-10-14 08:24:23,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1650254.6666666667, ans=0.0 2023-10-14 08:24:25,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1650254.6666666667, ans=0.125 2023-10-14 08:24:38,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1650301.3333333333, ans=0.125 2023-10-14 08:25:14,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1650441.3333333333, ans=0.125 2023-10-14 08:25:37,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1650534.6666666667, ans=0.2 2023-10-14 08:25:41,704 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=22.5 2023-10-14 08:25:49,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1650581.3333333333, ans=0.04949747468305833 2023-10-14 08:25:55,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.797e+02 1.989e+02 2.228e+02 3.382e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-14 08:26:14,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650721.3333333333, ans=0.1 2023-10-14 08:26:18,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1650721.3333333333, ans=0.125 2023-10-14 08:27:15,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1650954.6666666667, ans=0.07 2023-10-14 08:27:24,558 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:27:31,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1651001.3333333333, ans=0.04949747468305833 2023-10-14 08:27:34,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1651048.0, ans=0.5 2023-10-14 08:27:46,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-10-14 08:27:48,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.882e+02 2.150e+02 2.420e+02 3.466e+02, threshold=4.301e+02, percent-clipped=0.0 2023-10-14 08:28:42,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1651328.0, ans=0.5 2023-10-14 08:28:44,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1651328.0, ans=0.2 2023-10-14 08:28:51,386 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:29:03,873 INFO [train.py:1031] (3/4) Epoch 26, batch 12500, loss[loss=0.1763, simple_loss=0.2737, pruned_loss=0.03945, over 16927.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.278, pruned_loss=0.04648, over 32742066.02 frames. ], batch size: 77, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:29:06,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1651421.3333333333, ans=0.125 2023-10-14 08:29:23,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1651468.0, ans=0.025 2023-10-14 08:29:25,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1651514.6666666667, ans=0.0 2023-10-14 08:29:25,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1651514.6666666667, ans=0.125 2023-10-14 08:29:31,448 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-14 08:29:38,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1651561.3333333333, ans=0.2 2023-10-14 08:29:39,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.814e+02 2.010e+02 2.210e+02 3.033e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 08:29:46,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.67 vs. limit=6.0 2023-10-14 08:29:57,527 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:29:58,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1651654.6666666667, ans=0.125 2023-10-14 08:30:04,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651654.6666666667, ans=0.1 2023-10-14 08:30:08,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.56 vs. limit=10.0 2023-10-14 08:30:40,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1651841.3333333333, ans=0.125 2023-10-14 08:30:41,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1651841.3333333333, ans=0.0 2023-10-14 08:30:43,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1651841.3333333333, ans=0.125 2023-10-14 08:30:45,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1651841.3333333333, ans=0.125 2023-10-14 08:30:55,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1651888.0, ans=0.0 2023-10-14 08:30:57,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.52 vs. limit=22.5 2023-10-14 08:31:07,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=15.0 2023-10-14 08:31:27,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.851e+02 2.021e+02 2.328e+02 3.098e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 08:31:40,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1652074.6666666667, ans=0.125 2023-10-14 08:31:55,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1652121.3333333333, ans=0.125 2023-10-14 08:31:58,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1652168.0, ans=0.2 2023-10-14 08:31:59,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1652168.0, ans=0.125 2023-10-14 08:32:00,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1652168.0, ans=15.0 2023-10-14 08:32:29,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1652308.0, ans=0.125 2023-10-14 08:32:31,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1652308.0, ans=0.125 2023-10-14 08:32:32,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1652308.0, ans=0.125 2023-10-14 08:32:49,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1652401.3333333333, ans=0.125 2023-10-14 08:32:59,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1652448.0, ans=0.125 2023-10-14 08:33:05,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1652448.0, ans=0.125 2023-10-14 08:33:07,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1652448.0, ans=0.125 2023-10-14 08:33:13,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.796e+02 1.910e+02 2.080e+02 3.328e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-14 08:33:13,954 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-14 08:33:27,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1652541.3333333333, ans=0.0 2023-10-14 08:33:37,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1652588.0, ans=0.125 2023-10-14 08:33:48,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1652634.6666666667, ans=0.0 2023-10-14 08:33:53,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1652681.3333333333, ans=0.125 2023-10-14 08:34:23,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1652774.6666666667, ans=0.0 2023-10-14 08:34:39,239 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.22 vs. limit=15.0 2023-10-14 08:34:41,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-10-14 08:34:52,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1652914.6666666667, ans=0.0 2023-10-14 08:35:01,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.823e+02 1.949e+02 2.101e+02 3.083e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-14 08:35:02,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1652961.3333333333, ans=0.015 2023-10-14 08:35:14,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653008.0, ans=0.1 2023-10-14 08:35:58,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1653194.6666666667, ans=0.05 2023-10-14 08:36:15,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1653241.3333333333, ans=0.1 2023-10-14 08:36:17,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1653241.3333333333, ans=0.035 2023-10-14 08:36:42,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1653381.3333333333, ans=0.1 2023-10-14 08:36:53,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.831e+02 2.006e+02 2.265e+02 3.278e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 08:37:02,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653474.6666666667, ans=0.1 2023-10-14 08:37:28,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1653568.0, ans=0.2 2023-10-14 08:37:29,471 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.20 vs. limit=22.5 2023-10-14 08:38:01,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1653708.0, ans=0.0 2023-10-14 08:38:03,307 INFO [train.py:1031] (3/4) Epoch 26, batch 13000, loss[loss=0.1831, simple_loss=0.2791, pruned_loss=0.04354, over 16851.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2786, pruned_loss=0.04667, over 32730718.99 frames. ], batch size: 155, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:38:07,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1653754.6666666667, ans=0.125 2023-10-14 08:38:13,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1653801.3333333333, ans=0.0 2023-10-14 08:38:14,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1653801.3333333333, ans=0.125 2023-10-14 08:38:14,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.50 vs. limit=22.5 2023-10-14 08:38:16,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1653801.3333333333, ans=0.125 2023-10-14 08:38:45,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.852e+02 2.021e+02 2.157e+02 2.791e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 08:39:12,552 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.85 vs. limit=12.0 2023-10-14 08:39:37,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1654081.3333333333, ans=0.1 2023-10-14 08:39:38,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1654081.3333333333, ans=0.125 2023-10-14 08:39:45,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1654128.0, ans=0.0 2023-10-14 08:39:48,488 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2023-10-14 08:39:50,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1654174.6666666667, ans=0.1 2023-10-14 08:40:04,893 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=15.0 2023-10-14 08:40:05,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=12.0 2023-10-14 08:40:13,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1654268.0, ans=0.0 2023-10-14 08:40:24,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-10-14 08:40:40,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.836e+02 1.989e+02 2.242e+02 2.804e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-14 08:40:57,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1654454.6666666667, ans=0.0 2023-10-14 08:41:01,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1654454.6666666667, ans=0.125 2023-10-14 08:41:14,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1654501.3333333333, ans=0.125 2023-10-14 08:41:15,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1654501.3333333333, ans=0.125 2023-10-14 08:41:19,339 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:41:28,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1654594.6666666667, ans=0.0 2023-10-14 08:42:08,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1654734.6666666667, ans=0.025 2023-10-14 08:42:15,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1654734.6666666667, ans=0.1 2023-10-14 08:42:34,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.793e+02 1.931e+02 2.135e+02 2.785e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 08:42:47,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1654874.6666666667, ans=0.0 2023-10-14 08:43:02,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1654968.0, ans=0.05 2023-10-14 08:43:08,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1654968.0, ans=0.2 2023-10-14 08:43:08,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1654968.0, ans=0.125 2023-10-14 08:43:08,218 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-10-14 08:43:13,525 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-10-14 08:43:27,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1655061.3333333333, ans=0.2 2023-10-14 08:43:42,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1655108.0, ans=0.125 2023-10-14 08:43:44,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1655108.0, ans=0.125 2023-10-14 08:43:53,441 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2023-10-14 08:44:02,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1655201.3333333333, ans=0.0 2023-10-14 08:44:03,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1655201.3333333333, ans=0.0 2023-10-14 08:44:10,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1655248.0, ans=0.125 2023-10-14 08:44:20,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1655294.6666666667, ans=0.1 2023-10-14 08:44:24,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.872e+02 2.006e+02 2.303e+02 3.317e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 08:44:31,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1655341.3333333333, ans=0.125 2023-10-14 08:44:35,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=12.0 2023-10-14 08:44:35,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1655341.3333333333, ans=0.0 2023-10-14 08:44:48,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1655434.6666666667, ans=0.125 2023-10-14 08:44:51,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=15.0 2023-10-14 08:45:26,027 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=15.0 2023-10-14 08:45:43,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1655668.0, ans=0.125 2023-10-14 08:45:50,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1655668.0, ans=0.125 2023-10-14 08:46:07,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-10-14 08:46:12,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.814e+02 1.981e+02 2.199e+02 3.897e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-14 08:46:24,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1655808.0, ans=0.125 2023-10-14 08:46:24,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1655808.0, ans=0.0 2023-10-14 08:46:26,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1655808.0, ans=0.125 2023-10-14 08:46:47,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1655901.3333333333, ans=0.125 2023-10-14 08:47:16,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1656041.3333333333, ans=0.125 2023-10-14 08:47:16,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1656041.3333333333, ans=0.0 2023-10-14 08:47:19,728 INFO [train.py:1031] (3/4) Epoch 26, batch 13500, loss[loss=0.2455, simple_loss=0.3091, pruned_loss=0.09098, over 15588.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2781, pruned_loss=0.04649, over 32762689.75 frames. ], batch size: 350, lr: 1.31e-03, grad_scale: 16.0 2023-10-14 08:47:23,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-14 08:47:29,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.57 vs. limit=6.0 2023-10-14 08:47:40,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1656181.3333333333, ans=0.125 2023-10-14 08:47:48,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1656181.3333333333, ans=0.125 2023-10-14 08:47:57,369 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2023-10-14 08:48:00,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1656228.0, ans=0.125 2023-10-14 08:48:02,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.752e+02 1.893e+02 2.130e+02 2.733e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-14 08:48:25,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1656321.3333333333, ans=0.1 2023-10-14 08:48:26,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1656368.0, ans=0.1 2023-10-14 08:48:41,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1656414.6666666667, ans=0.0 2023-10-14 08:48:49,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1656461.3333333333, ans=15.0 2023-10-14 08:49:12,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1656554.6666666667, ans=0.05 2023-10-14 08:49:30,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.44 vs. limit=22.5 2023-10-14 08:49:40,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1656694.6666666667, ans=0.125 2023-10-14 08:49:40,836 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:49:45,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.847e+02 1.971e+02 2.185e+02 3.851e+02, threshold=3.942e+02, percent-clipped=1.0 2023-10-14 08:49:46,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=22.5 2023-10-14 08:49:50,408 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-10-14 08:50:28,542 INFO [train.py:1031] (3/4) Epoch 27, batch 0, loss[loss=0.161, simple_loss=0.2519, pruned_loss=0.03507, over 16956.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2519, pruned_loss=0.03507, over 16956.00 frames. ], batch size: 82, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 08:50:28,543 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-14 08:50:36,529 INFO [train.py:1063] (3/4) Epoch 27, validation: loss=0.2135, simple_loss=0.2999, pruned_loss=0.06353, over 1020973.00 frames. 2023-10-14 08:50:36,529 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-14 08:50:39,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1656811.3333333333, ans=0.1 2023-10-14 08:51:08,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=10.0 2023-10-14 08:51:08,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1656904.6666666667, ans=0.125 2023-10-14 08:51:09,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1656904.6666666667, ans=0.1 2023-10-14 08:51:12,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1656951.3333333333, ans=0.0 2023-10-14 08:51:17,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1656951.3333333333, ans=0.125 2023-10-14 08:51:38,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1657044.6666666667, ans=0.125 2023-10-14 08:51:40,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1657044.6666666667, ans=0.125 2023-10-14 08:51:41,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1657044.6666666667, ans=0.2 2023-10-14 08:51:45,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.29 vs. limit=15.0 2023-10-14 08:51:55,024 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=12.0 2023-10-14 08:52:00,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1657138.0, ans=0.04949747468305833 2023-10-14 08:52:07,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.793e+02 1.961e+02 2.210e+02 4.315e+02, threshold=3.921e+02, percent-clipped=1.0 2023-10-14 08:52:32,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1657278.0, ans=0.125 2023-10-14 08:52:32,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1657278.0, ans=0.125 2023-10-14 08:52:34,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1657278.0, ans=0.2 2023-10-14 08:52:34,274 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.04 vs. limit=15.0 2023-10-14 08:53:02,263 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=15.0 2023-10-14 08:53:10,536 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1657464.6666666667, ans=0.125 2023-10-14 08:53:21,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1657511.3333333333, ans=0.125 2023-10-14 08:53:22,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1657511.3333333333, ans=0.125 2023-10-14 08:53:31,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1657558.0, ans=0.0 2023-10-14 08:53:32,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-14 08:53:42,579 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:53:44,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1657604.6666666667, ans=0.125 2023-10-14 08:53:45,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1657604.6666666667, ans=0.125 2023-10-14 08:53:55,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-10-14 08:53:57,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.812e+02 1.998e+02 2.247e+02 3.365e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 08:54:00,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1657651.3333333333, ans=0.125 2023-10-14 08:54:02,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1657698.0, ans=0.0 2023-10-14 08:54:06,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1657698.0, ans=0.1 2023-10-14 08:54:35,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1657838.0, ans=0.125 2023-10-14 08:54:35,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-10-14 08:54:41,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1657838.0, ans=0.1 2023-10-14 08:55:13,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.36 vs. limit=15.0 2023-10-14 08:55:19,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.75 vs. limit=15.0 2023-10-14 08:55:26,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1658024.6666666667, ans=0.1 2023-10-14 08:55:32,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1658024.6666666667, ans=0.1 2023-10-14 08:55:47,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.818e+02 1.993e+02 2.271e+02 3.593e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-14 08:55:49,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1658118.0, ans=0.0 2023-10-14 08:55:56,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1658164.6666666667, ans=0.125 2023-10-14 08:55:59,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1658164.6666666667, ans=0.95 2023-10-14 08:56:28,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1658304.6666666667, ans=0.125 2023-10-14 08:56:50,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1658398.0, ans=0.125 2023-10-14 08:56:51,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1658398.0, ans=0.05 2023-10-14 08:56:52,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1658398.0, ans=0.05 2023-10-14 08:56:54,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=15.0 2023-10-14 08:57:03,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1658444.6666666667, ans=15.0 2023-10-14 08:57:20,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1658538.0, ans=0.0 2023-10-14 08:57:24,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1658538.0, ans=0.0 2023-10-14 08:57:24,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1658538.0, ans=0.125 2023-10-14 08:57:31,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1658584.6666666667, ans=0.2 2023-10-14 08:57:31,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=22.5 2023-10-14 08:57:33,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.869e+02 2.061e+02 2.333e+02 3.241e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-14 08:57:48,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1658678.0, ans=0.2 2023-10-14 08:57:55,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1658678.0, ans=10.0 2023-10-14 08:58:05,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1658724.6666666667, ans=0.0 2023-10-14 08:58:06,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1658724.6666666667, ans=0.125 2023-10-14 08:58:25,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1658818.0, ans=0.0 2023-10-14 08:58:41,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=5.99 vs. limit=15.0 2023-10-14 08:58:55,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1658958.0, ans=0.125 2023-10-14 08:58:56,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=15.0 2023-10-14 08:59:13,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1659004.6666666667, ans=0.0 2023-10-14 08:59:22,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.839e+02 2.011e+02 2.290e+02 3.044e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-14 08:59:35,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1659098.0, ans=0.125 2023-10-14 08:59:40,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-10-14 08:59:40,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1659144.6666666667, ans=0.1 2023-10-14 08:59:41,467 INFO [train.py:1031] (3/4) Epoch 27, batch 500, loss[loss=0.171, simple_loss=0.2657, pruned_loss=0.0382, over 16951.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2787, pruned_loss=0.04686, over 7309791.52 frames. ], batch size: 77, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 09:00:06,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1659238.0, ans=0.125 2023-10-14 09:00:13,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1659284.6666666667, ans=0.125 2023-10-14 09:00:16,645 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=22.5 2023-10-14 09:00:18,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1659284.6666666667, ans=0.0 2023-10-14 09:00:22,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-10-14 09:00:32,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1659331.3333333333, ans=0.125 2023-10-14 09:00:34,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659378.0, ans=0.1 2023-10-14 09:00:49,122 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.31 vs. limit=15.0 2023-10-14 09:00:59,448 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-14 09:01:06,496 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.01 vs. limit=10.0 2023-10-14 09:01:14,253 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.61 vs. limit=15.0 2023-10-14 09:01:14,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.857e+02 2.069e+02 2.294e+02 3.025e+02, threshold=4.138e+02, percent-clipped=0.0 2023-10-14 09:01:29,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1659564.6666666667, ans=0.125 2023-10-14 09:01:30,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-10-14 09:01:37,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1659611.3333333333, ans=0.1 2023-10-14 09:02:00,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=22.5 2023-10-14 09:02:06,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1659751.3333333333, ans=0.125 2023-10-14 09:02:10,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1659751.3333333333, ans=0.2 2023-10-14 09:02:45,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659891.3333333333, ans=0.1 2023-10-14 09:03:05,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.823e+02 1.971e+02 2.185e+02 2.882e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-14 09:03:10,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1660031.3333333333, ans=0.1 2023-10-14 09:03:20,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1660078.0, ans=0.125 2023-10-14 09:03:35,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1660124.6666666667, ans=0.0 2023-10-14 09:04:20,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1660311.3333333333, ans=0.125 2023-10-14 09:04:25,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1660311.3333333333, ans=0.125 2023-10-14 09:04:46,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.20 vs. limit=10.0 2023-10-14 09:04:48,963 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.22 vs. limit=15.0 2023-10-14 09:04:57,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.837e+02 2.036e+02 2.228e+02 3.264e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-14 09:04:57,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1660451.3333333333, ans=0.125 2023-10-14 09:05:03,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-10-14 09:05:03,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-10-14 09:05:04,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1660498.0, ans=0.2 2023-10-14 09:05:29,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1660591.3333333333, ans=0.125 2023-10-14 09:05:32,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.77 vs. limit=15.0 2023-10-14 09:05:37,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1660638.0, ans=0.125 2023-10-14 09:05:52,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1660684.6666666667, ans=0.125 2023-10-14 09:05:59,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1660684.6666666667, ans=0.125 2023-10-14 09:06:02,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1660684.6666666667, ans=0.125 2023-10-14 09:06:22,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1660778.0, ans=0.035 2023-10-14 09:06:30,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1660824.6666666667, ans=0.0 2023-10-14 09:06:31,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1660824.6666666667, ans=0.04949747468305833 2023-10-14 09:06:33,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1660824.6666666667, ans=0.125 2023-10-14 09:06:36,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1660871.3333333333, ans=0.1 2023-10-14 09:06:40,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1660871.3333333333, ans=0.125 2023-10-14 09:06:52,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.834e+02 1.973e+02 2.183e+02 3.067e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-14 09:06:56,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1660964.6666666667, ans=0.025 2023-10-14 09:07:31,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1661104.6666666667, ans=10.0 2023-10-14 09:07:43,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1661151.3333333333, ans=0.0 2023-10-14 09:07:53,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1661198.0, ans=0.0 2023-10-14 09:08:15,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1661291.3333333333, ans=0.1 2023-10-14 09:08:30,560 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-10-14 09:08:30,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1661338.0, ans=0.125 2023-10-14 09:08:36,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1661338.0, ans=0.125 2023-10-14 09:08:43,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.768e+02 1.895e+02 2.075e+02 2.674e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-14 09:08:49,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1661431.3333333333, ans=0.1 2023-10-14 09:08:57,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1661431.3333333333, ans=0.125 2023-10-14 09:08:59,888 INFO [train.py:1031] (3/4) Epoch 27, batch 1000, loss[loss=0.1807, simple_loss=0.2812, pruned_loss=0.04005, over 16858.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2792, pruned_loss=0.04711, over 12946400.86 frames. ], batch size: 175, lr: 1.28e-03, grad_scale: 16.0 2023-10-14 09:09:10,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1661524.6666666667, ans=0.0 2023-10-14 09:09:22,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1661571.3333333333, ans=0.0 2023-10-14 09:09:26,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1661571.3333333333, ans=0.04949747468305833 2023-10-14 09:09:39,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1661618.0, ans=0.125 2023-10-14 09:09:49,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1661664.6666666667, ans=0.2 2023-10-14 09:10:05,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1661758.0, ans=0.125 2023-10-14 09:10:12,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1661804.6666666667, ans=0.2 2023-10-14 09:10:28,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.892e+02 2.081e+02 2.315e+02 3.226e+02, threshold=4.161e+02, percent-clipped=0.0 2023-10-14 09:10:45,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1661944.6666666667, ans=0.125 2023-10-14 09:10:47,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1661944.6666666667, ans=0.0 2023-10-14 09:10:47,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1661944.6666666667, ans=0.0 2023-10-14 09:10:48,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1661944.6666666667, ans=0.125 2023-10-14 09:11:18,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1662038.0, ans=0.0 2023-10-14 09:11:37,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1662131.3333333333, ans=0.125 2023-10-14 09:11:55,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-10-14 09:12:30,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.750e+02 1.885e+02 2.174e+02 3.228e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-14 09:12:36,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1662364.6666666667, ans=0.125 2023-10-14 09:12:46,266 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:12:57,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1662458.0, ans=0.1 2023-10-14 09:12:59,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1662458.0, ans=0.125 2023-10-14 09:12:59,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662458.0, ans=0.1 2023-10-14 09:13:07,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1662504.6666666667, ans=0.125 2023-10-14 09:13:10,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1662504.6666666667, ans=0.2 2023-10-14 09:13:10,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.03 vs. limit=10.0 2023-10-14 09:13:11,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1662504.6666666667, ans=0.125 2023-10-14 09:13:34,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1662598.0, ans=0.125 2023-10-14 09:13:44,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1662644.6666666667, ans=0.125 2023-10-14 09:13:45,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662644.6666666667, ans=0.1 2023-10-14 09:13:45,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.63 vs. limit=15.0 2023-10-14 09:13:54,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1662691.3333333333, ans=0.125 2023-10-14 09:14:06,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1662738.0, ans=0.0 2023-10-14 09:14:07,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1662738.0, ans=0.125 2023-10-14 09:14:17,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.811e+02 2.049e+02 2.442e+02 3.152e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-14 09:14:21,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1662831.3333333333, ans=0.0 2023-10-14 09:14:22,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1662831.3333333333, ans=0.0 2023-10-14 09:14:29,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1662831.3333333333, ans=0.2 2023-10-14 09:14:33,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1662878.0, ans=0.2 2023-10-14 09:14:51,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.11 vs. limit=22.5 2023-10-14 09:15:08,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1663018.0, ans=0.0 2023-10-14 09:15:12,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1663018.0, ans=0.05 2023-10-14 09:15:35,015 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.95 vs. limit=10.0 2023-10-14 09:15:39,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1663158.0, ans=0.0 2023-10-14 09:15:40,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.71 vs. limit=15.0 2023-10-14 09:15:42,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1663158.0, ans=15.0 2023-10-14 09:15:43,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.10 vs. limit=15.0 2023-10-14 09:15:46,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1663204.6666666667, ans=0.1 2023-10-14 09:16:04,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.799e+02 1.939e+02 2.088e+02 3.320e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-14 09:16:04,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1663251.3333333333, ans=0.0 2023-10-14 09:16:32,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1663344.6666666667, ans=0.0 2023-10-14 09:16:40,423 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.23 vs. limit=15.0 2023-10-14 09:17:00,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1663484.6666666667, ans=0.125 2023-10-14 09:17:24,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1663578.0, ans=0.0 2023-10-14 09:17:35,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.57 vs. limit=22.5 2023-10-14 09:17:38,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1663671.3333333333, ans=0.0 2023-10-14 09:17:53,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.754e+02 1.959e+02 2.118e+02 3.199e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 09:18:11,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1663764.6666666667, ans=0.0 2023-10-14 09:18:13,321 INFO [train.py:1031] (3/4) Epoch 27, batch 1500, loss[loss=0.1634, simple_loss=0.2633, pruned_loss=0.03177, over 16877.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2779, pruned_loss=0.04662, over 17367580.59 frames. ], batch size: 104, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 09:18:16,480 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-10-14 09:18:39,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1663904.6666666667, ans=0.5 2023-10-14 09:19:28,862 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:19:45,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.782e+02 1.908e+02 2.102e+02 2.972e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-14 09:19:50,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1664231.3333333333, ans=0.0 2023-10-14 09:20:02,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1664278.0, ans=0.0 2023-10-14 09:20:15,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.08 vs. limit=22.5 2023-10-14 09:20:19,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1664324.6666666667, ans=0.07 2023-10-14 09:20:22,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1664371.3333333333, ans=0.1 2023-10-14 09:20:31,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1664418.0, ans=0.125 2023-10-14 09:20:40,266 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-14 09:20:46,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664464.6666666667, ans=0.0 2023-10-14 09:21:28,168 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1664604.6666666667, ans=0.125 2023-10-14 09:21:38,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.775e+02 1.904e+02 2.092e+02 2.505e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-14 09:21:43,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-14 09:21:44,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1664698.0, ans=0.125 2023-10-14 09:21:51,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1664698.0, ans=0.0 2023-10-14 09:21:54,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1664744.6666666667, ans=0.125 2023-10-14 09:22:14,090 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-10-14 09:22:43,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1664931.3333333333, ans=0.125 2023-10-14 09:22:46,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1664978.0, ans=0.125 2023-10-14 09:23:07,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1665071.3333333333, ans=0.125 2023-10-14 09:23:27,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.840e+02 2.087e+02 2.357e+02 2.992e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 09:23:54,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1665211.3333333333, ans=0.125 2023-10-14 09:23:57,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1665258.0, ans=0.125 2023-10-14 09:23:58,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1665258.0, ans=10.0 2023-10-14 09:23:58,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1665258.0, ans=0.0 2023-10-14 09:24:02,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1665258.0, ans=0.0 2023-10-14 09:24:11,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.21 vs. limit=22.5 2023-10-14 09:24:20,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1665351.3333333333, ans=0.125 2023-10-14 09:24:34,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1665398.0, ans=0.0 2023-10-14 09:24:49,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665444.6666666667, ans=0.1 2023-10-14 09:24:53,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1665491.3333333333, ans=0.1 2023-10-14 09:24:55,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.75 vs. limit=22.5 2023-10-14 09:25:19,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.835e+02 1.988e+02 2.209e+02 3.411e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 09:25:19,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1665584.6666666667, ans=0.125 2023-10-14 09:25:52,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1665724.6666666667, ans=0.0 2023-10-14 09:26:11,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1665818.0, ans=0.2 2023-10-14 09:26:19,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1665864.6666666667, ans=0.2 2023-10-14 09:26:23,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1665864.6666666667, ans=0.2 2023-10-14 09:26:41,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1665911.3333333333, ans=0.0 2023-10-14 09:27:02,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1666004.6666666667, ans=0.1 2023-10-14 09:27:21,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.823e+02 1.929e+02 2.193e+02 4.546e+02, threshold=3.858e+02, percent-clipped=1.0 2023-10-14 09:27:25,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1666051.3333333333, ans=15.0 2023-10-14 09:27:30,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1666098.0, ans=0.1 2023-10-14 09:27:37,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1666144.6666666667, ans=0.0 2023-10-14 09:27:37,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.23 vs. limit=15.0 2023-10-14 09:27:38,919 INFO [train.py:1031] (3/4) Epoch 27, batch 2000, loss[loss=0.1852, simple_loss=0.2879, pruned_loss=0.0413, over 16818.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2782, pruned_loss=0.04672, over 20744480.77 frames. ], batch size: 87, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 09:28:17,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1666238.0, ans=0.125 2023-10-14 09:28:27,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1666284.6666666667, ans=0.2 2023-10-14 09:28:52,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1666378.0, ans=0.1 2023-10-14 09:28:52,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1666378.0, ans=0.125 2023-10-14 09:28:55,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1666424.6666666667, ans=0.0 2023-10-14 09:29:00,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1666424.6666666667, ans=0.2 2023-10-14 09:29:02,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1666424.6666666667, ans=0.0 2023-10-14 09:29:26,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.819e+02 2.043e+02 2.265e+02 3.390e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-14 09:29:33,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1666564.6666666667, ans=0.2 2023-10-14 09:29:35,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1666564.6666666667, ans=0.125 2023-10-14 09:30:09,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1666658.0, ans=0.125 2023-10-14 09:30:34,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1666704.6666666667, ans=0.1 2023-10-14 09:30:35,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1666704.6666666667, ans=0.125 2023-10-14 09:30:42,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1666751.3333333333, ans=0.125 2023-10-14 09:30:53,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1666798.0, ans=0.2 2023-10-14 09:30:58,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1666798.0, ans=0.125 2023-10-14 09:31:20,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2023-10-14 09:31:31,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1666938.0, ans=0.0 2023-10-14 09:31:31,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1666938.0, ans=0.125 2023-10-14 09:31:43,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-14 09:31:46,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1666984.6666666667, ans=0.125 2023-10-14 09:31:47,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.834e+02 2.017e+02 2.244e+02 2.925e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-14 09:31:47,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1666984.6666666667, ans=0.0 2023-10-14 09:32:18,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1667124.6666666667, ans=0.125 2023-10-14 09:32:40,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1667218.0, ans=0.2 2023-10-14 09:33:03,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1667358.0, ans=0.125 2023-10-14 09:33:36,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.867e+02 2.008e+02 2.224e+02 2.702e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-14 09:33:43,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1667498.0, ans=0.125 2023-10-14 09:33:51,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1667544.6666666667, ans=0.125 2023-10-14 09:33:52,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1667544.6666666667, ans=0.025 2023-10-14 09:34:10,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1667638.0, ans=0.0 2023-10-14 09:34:40,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-10-14 09:34:48,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1667778.0, ans=0.125 2023-10-14 09:35:15,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1667871.3333333333, ans=0.2 2023-10-14 09:35:19,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1667918.0, ans=0.05 2023-10-14 09:35:24,865 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-10-14 09:35:26,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.936e+02 2.103e+02 2.370e+02 2.947e+02, threshold=4.207e+02, percent-clipped=0.0 2023-10-14 09:35:26,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1667918.0, ans=0.2 2023-10-14 09:35:27,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1667964.6666666667, ans=0.0 2023-10-14 09:36:11,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1668151.3333333333, ans=0.0 2023-10-14 09:36:27,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1668198.0, ans=0.2 2023-10-14 09:36:41,519 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-10-14 09:36:41,569 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.02 vs. limit=22.5 2023-10-14 09:36:56,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=8.0 2023-10-14 09:36:56,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668338.0, ans=0.1 2023-10-14 09:36:58,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1668338.0, ans=0.0 2023-10-14 09:37:12,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.855e+02 2.058e+02 2.310e+02 2.908e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-14 09:37:13,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1668431.3333333333, ans=0.0 2023-10-14 09:37:24,140 INFO [train.py:1031] (3/4) Epoch 27, batch 2500, loss[loss=0.1813, simple_loss=0.274, pruned_loss=0.04426, over 16475.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2786, pruned_loss=0.04682, over 23436257.72 frames. ], batch size: 266, lr: 1.28e-03, grad_scale: 16.0 2023-10-14 09:37:37,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1668524.6666666667, ans=0.2 2023-10-14 09:37:37,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1668524.6666666667, ans=0.09899494936611666 2023-10-14 09:37:41,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1668524.6666666667, ans=0.04949747468305833 2023-10-14 09:37:46,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-10-14 09:37:55,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2023-10-14 09:38:33,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1668758.0, ans=0.0 2023-10-14 09:38:48,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1668851.3333333333, ans=0.125 2023-10-14 09:38:58,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.897e+02 2.070e+02 2.304e+02 5.805e+02, threshold=4.141e+02, percent-clipped=1.0 2023-10-14 09:39:15,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1668944.6666666667, ans=0.0 2023-10-14 09:39:37,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=1669038.0, ans=15.0 2023-10-14 09:39:38,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1669038.0, ans=0.2 2023-10-14 09:39:54,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1669131.3333333333, ans=0.125 2023-10-14 09:39:56,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.42 vs. limit=22.5 2023-10-14 09:40:08,911 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:40:31,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-10-14 09:40:32,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1669271.3333333333, ans=0.125 2023-10-14 09:40:33,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1669271.3333333333, ans=0.0 2023-10-14 09:40:44,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1669318.0, ans=0.0 2023-10-14 09:40:44,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.821e+02 1.993e+02 2.187e+02 3.390e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-14 09:40:46,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1669364.6666666667, ans=0.0 2023-10-14 09:40:51,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1669364.6666666667, ans=0.125 2023-10-14 09:40:58,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1669411.3333333333, ans=0.125 2023-10-14 09:41:01,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1669411.3333333333, ans=0.2 2023-10-14 09:41:04,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1669411.3333333333, ans=0.125 2023-10-14 09:41:04,504 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.99 vs. limit=22.5 2023-10-14 09:41:32,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669551.3333333333, ans=0.1 2023-10-14 09:41:39,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1669551.3333333333, ans=0.1 2023-10-14 09:41:44,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1669598.0, ans=0.125 2023-10-14 09:42:14,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1669691.3333333333, ans=0.5 2023-10-14 09:42:26,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1669738.0, ans=0.125 2023-10-14 09:42:45,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.786e+02 1.964e+02 2.192e+02 3.131e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-14 09:42:49,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1669831.3333333333, ans=0.1 2023-10-14 09:43:06,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-14 09:43:19,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1669924.6666666667, ans=0.125 2023-10-14 09:43:35,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670018.0, ans=0.1 2023-10-14 09:43:57,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1670064.6666666667, ans=0.125 2023-10-14 09:43:59,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1670111.3333333333, ans=0.125 2023-10-14 09:44:05,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1670111.3333333333, ans=0.0 2023-10-14 09:44:12,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670158.0, ans=0.1 2023-10-14 09:44:21,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1670158.0, ans=10.0 2023-10-14 09:44:34,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1670204.6666666667, ans=0.05 2023-10-14 09:44:37,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1670251.3333333333, ans=0.0 2023-10-14 09:44:46,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.820e+02 2.018e+02 2.273e+02 2.903e+02, threshold=4.036e+02, percent-clipped=0.0 2023-10-14 09:44:49,889 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1670298.0, ans=0.125 2023-10-14 09:44:56,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1670298.0, ans=0.125 2023-10-14 09:45:24,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1670391.3333333333, ans=0.125 2023-10-14 09:45:26,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-14 09:45:43,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1670484.6666666667, ans=0.125 2023-10-14 09:45:50,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1670531.3333333333, ans=0.125 2023-10-14 09:45:53,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1670531.3333333333, ans=0.125 2023-10-14 09:46:06,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1670578.0, ans=0.0 2023-10-14 09:46:41,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.811e+02 1.990e+02 2.206e+02 2.848e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-14 09:46:43,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1670764.6666666667, ans=0.0 2023-10-14 09:46:51,450 INFO [train.py:1031] (3/4) Epoch 27, batch 3000, loss[loss=0.1828, simple_loss=0.2749, pruned_loss=0.04541, over 15947.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2779, pruned_loss=0.04669, over 25526224.73 frames. ], batch size: 43, lr: 1.28e-03, grad_scale: 8.0 2023-10-14 09:46:53,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1670811.3333333333, ans=0.125 2023-10-14 09:46:54,364 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.61 vs. limit=15.0 2023-10-14 09:47:01,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1670858.0, ans=0.125 2023-10-14 09:47:17,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1670904.6666666667, ans=0.0 2023-10-14 09:47:17,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=1670904.6666666667, ans=6.0 2023-10-14 09:47:18,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1670904.6666666667, ans=0.2 2023-10-14 09:47:24,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1670951.3333333333, ans=0.0 2023-10-14 09:47:41,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1670998.0, ans=0.125 2023-10-14 09:48:02,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1671091.3333333333, ans=0.04949747468305833 2023-10-14 09:48:17,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1671138.0, ans=0.1 2023-10-14 09:48:20,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1671184.6666666667, ans=0.125 2023-10-14 09:48:21,610 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.72 vs. limit=12.0 2023-10-14 09:48:24,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1671184.6666666667, ans=0.2 2023-10-14 09:48:30,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.789e+02 1.948e+02 2.134e+02 2.677e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-14 09:48:59,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1671324.6666666667, ans=0.07 2023-10-14 09:49:31,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1671418.0, ans=0.0 2023-10-14 09:49:41,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1671464.6666666667, ans=0.125 2023-10-14 09:50:08,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671604.6666666667, ans=0.1 2023-10-14 09:50:20,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1671651.3333333333, ans=0.125 2023-10-14 09:50:21,480 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-10-14 09:50:29,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.787e+02 1.946e+02 2.089e+02 3.241e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-14 09:50:31,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.92 vs. limit=15.0 2023-10-14 09:50:51,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-14 09:50:55,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1671791.3333333333, ans=12.0 2023-10-14 09:51:13,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1671838.0, ans=0.125 2023-10-14 09:51:41,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1671931.3333333333, ans=0.125 2023-10-14 09:52:03,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1672024.6666666667, ans=0.125 2023-10-14 09:52:19,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1672071.3333333333, ans=0.0 2023-10-14 09:52:22,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1672071.3333333333, ans=0.0 2023-10-14 09:52:29,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1672118.0, ans=0.0 2023-10-14 09:52:31,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1672118.0, ans=0.125 2023-10-14 09:52:36,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1672164.6666666667, ans=0.0 2023-10-14 09:52:37,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.815e+02 1.957e+02 2.178e+02 3.010e+02, threshold=3.915e+02, percent-clipped=0.0 2023-10-14 09:52:38,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1672164.6666666667, ans=0.05 2023-10-14 09:52:57,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1672211.3333333333, ans=0.125 2023-10-14 09:53:11,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1672258.0, ans=0.0 2023-10-14 09:53:20,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1672304.6666666667, ans=0.1 2023-10-14 09:53:43,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1672398.0, ans=0.0 2023-10-14 09:53:49,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1672444.6666666667, ans=0.0 2023-10-14 09:54:10,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672538.0, ans=0.1 2023-10-14 09:54:17,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1672538.0, ans=0.125 2023-10-14 09:54:41,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.891e+02 2.068e+02 2.259e+02 3.172e+02, threshold=4.137e+02, percent-clipped=0.0 2023-10-14 09:54:47,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1672631.3333333333, ans=0.125 2023-10-14 09:55:04,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.89 vs. limit=15.0 2023-10-14 09:55:05,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1672724.6666666667, ans=0.125 2023-10-14 09:55:35,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.66 vs. limit=15.0 2023-10-14 09:55:44,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1672864.6666666667, ans=0.0 2023-10-14 09:55:51,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1672911.3333333333, ans=0.0 2023-10-14 09:55:52,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2023-10-14 09:56:26,278 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.68 vs. limit=22.5 2023-10-14 09:56:32,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1673051.3333333333, ans=0.0 2023-10-14 09:56:39,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.855e+02 2.002e+02 2.190e+02 2.978e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-14 09:56:53,390 INFO [train.py:1031] (3/4) Epoch 27, batch 3500, loss[loss=0.2015, simple_loss=0.296, pruned_loss=0.05355, over 16905.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2775, pruned_loss=0.04662, over 27109521.93 frames. ], batch size: 138, lr: 1.28e-03, grad_scale: 16.0 2023-10-14 09:56:53,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1673144.6666666667, ans=0.125 2023-10-14 09:57:31,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1673284.6666666667, ans=0.125 2023-10-14 09:57:40,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1673284.6666666667, ans=0.0 2023-10-14 09:58:12,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1673424.6666666667, ans=0.125 2023-10-14 09:58:52,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.966e+02 2.149e+02 2.459e+02 4.602e+02, threshold=4.298e+02, percent-clipped=1.0 2023-10-14 09:58:59,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1673564.6666666667, ans=0.125 2023-10-14 09:59:06,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1673611.3333333333, ans=0.125 2023-10-14 09:59:07,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1673611.3333333333, ans=0.0 2023-10-14 09:59:13,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.54 vs. limit=15.0 2023-10-14 09:59:18,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1673658.0, ans=0.125 2023-10-14 09:59:38,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1673704.6666666667, ans=0.0 2023-10-14 09:59:59,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1673751.3333333333, ans=0.125 2023-10-14 10:00:02,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673798.0, ans=0.1 2023-10-14 10:00:14,109 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:00:16,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1673844.6666666667, ans=0.0 2023-10-14 10:00:23,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1673891.3333333333, ans=0.125 2023-10-14 10:00:47,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1673984.6666666667, ans=0.0 2023-10-14 10:00:55,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1673984.6666666667, ans=0.0 2023-10-14 10:00:56,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1673984.6666666667, ans=0.125 2023-10-14 10:01:00,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.806e+02 1.973e+02 2.278e+02 3.650e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 10:01:17,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1674078.0, ans=0.0 2023-10-14 10:01:18,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1674078.0, ans=0.0 2023-10-14 10:01:21,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1674078.0, ans=0.125 2023-10-14 10:01:50,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1674171.3333333333, ans=0.0 2023-10-14 10:01:52,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1674218.0, ans=0.125 2023-10-14 10:01:54,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1674218.0, ans=0.125 2023-10-14 10:02:16,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1674311.3333333333, ans=0.125 2023-10-14 10:02:26,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1674311.3333333333, ans=0.0 2023-10-14 10:02:27,148 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-14 10:02:31,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1674358.0, ans=0.125 2023-10-14 10:02:52,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1674451.3333333333, ans=0.1 2023-10-14 10:03:06,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.774e+02 1.998e+02 2.172e+02 2.795e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 10:03:12,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-10-14 10:03:12,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1674498.0, ans=0.05 2023-10-14 10:03:28,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1674544.6666666667, ans=0.0 2023-10-14 10:03:29,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1674591.3333333333, ans=0.0 2023-10-14 10:03:39,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1674591.3333333333, ans=0.125 2023-10-14 10:03:42,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1674591.3333333333, ans=0.025 2023-10-14 10:03:51,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-10-14 10:04:17,085 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.70 vs. limit=10.0 2023-10-14 10:04:17,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1674731.3333333333, ans=0.125 2023-10-14 10:04:21,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1674778.0, ans=0.0 2023-10-14 10:04:39,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1674824.6666666667, ans=0.0 2023-10-14 10:04:58,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1674871.3333333333, ans=0.5 2023-10-14 10:05:14,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.746e+02 1.875e+02 2.230e+02 2.957e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-14 10:05:23,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1674964.6666666667, ans=0.2 2023-10-14 10:05:31,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1675011.3333333333, ans=0.0 2023-10-14 10:05:39,757 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:05:42,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1675058.0, ans=0.07 2023-10-14 10:05:43,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1675058.0, ans=0.125 2023-10-14 10:06:21,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1675198.0, ans=0.125 2023-10-14 10:06:27,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1675244.6666666667, ans=0.2 2023-10-14 10:06:42,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1675291.3333333333, ans=0.125 2023-10-14 10:06:59,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1675384.6666666667, ans=0.2 2023-10-14 10:07:07,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=12.0 2023-10-14 10:07:12,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.762e+02 1.935e+02 2.171e+02 2.801e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-14 10:07:13,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1675431.3333333333, ans=0.125 2023-10-14 10:07:14,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1675431.3333333333, ans=0.125 2023-10-14 10:07:20,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1675431.3333333333, ans=0.125 2023-10-14 10:07:22,768 INFO [train.py:1031] (3/4) Epoch 27, batch 4000, loss[loss=0.1891, simple_loss=0.2843, pruned_loss=0.04702, over 16956.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2772, pruned_loss=0.04671, over 28378274.66 frames. ], batch size: 123, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:07:50,923 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-10-14 10:07:52,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1675571.3333333333, ans=0.125 2023-10-14 10:08:13,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1675664.6666666667, ans=0.0 2023-10-14 10:08:15,803 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-10-14 10:08:18,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1675664.6666666667, ans=0.2 2023-10-14 10:08:22,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1675664.6666666667, ans=0.2 2023-10-14 10:08:24,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1675664.6666666667, ans=0.0 2023-10-14 10:08:33,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1675711.3333333333, ans=0.0 2023-10-14 10:08:42,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-10-14 10:09:17,293 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.869e+02 2.046e+02 2.247e+02 3.139e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-14 10:09:17,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675898.0, ans=0.1 2023-10-14 10:09:20,372 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:09:29,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1675944.6666666667, ans=0.0 2023-10-14 10:09:36,262 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2023-10-14 10:09:53,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1676038.0, ans=0.0 2023-10-14 10:10:05,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676084.6666666667, ans=0.1 2023-10-14 10:10:15,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1676131.3333333333, ans=0.125 2023-10-14 10:10:23,394 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:10:33,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1676178.0, ans=0.0 2023-10-14 10:10:33,996 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-10-14 10:10:46,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676224.6666666667, ans=0.1 2023-10-14 10:10:46,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.30 vs. limit=10.0 2023-10-14 10:11:04,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676271.3333333333, ans=0.1 2023-10-14 10:11:08,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.60 vs. limit=12.0 2023-10-14 10:11:10,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1676271.3333333333, ans=0.125 2023-10-14 10:11:15,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1676318.0, ans=0.125 2023-10-14 10:11:17,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1676318.0, ans=0.125 2023-10-14 10:11:29,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1676364.6666666667, ans=0.125 2023-10-14 10:11:31,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.822e+02 1.949e+02 2.140e+02 3.283e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-14 10:11:32,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-10-14 10:11:54,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=15.0 2023-10-14 10:12:15,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1676504.6666666667, ans=0.0 2023-10-14 10:12:17,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1676504.6666666667, ans=0.125 2023-10-14 10:12:30,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1676551.3333333333, ans=0.04949747468305833 2023-10-14 10:12:32,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=22.5 2023-10-14 10:12:39,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.33 vs. limit=15.0 2023-10-14 10:12:42,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1676598.0, ans=0.0 2023-10-14 10:12:51,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-14 10:12:57,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1676691.3333333333, ans=0.125 2023-10-14 10:13:12,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1676738.0, ans=0.125 2023-10-14 10:13:14,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676738.0, ans=0.1 2023-10-14 10:13:29,662 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:13:30,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676784.6666666667, ans=0.1 2023-10-14 10:13:36,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.827e+02 1.988e+02 2.160e+02 2.795e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-14 10:13:49,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1676878.0, ans=0.2 2023-10-14 10:14:38,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1677064.6666666667, ans=0.125 2023-10-14 10:14:43,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1677111.3333333333, ans=0.125 2023-10-14 10:14:53,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1677111.3333333333, ans=0.125 2023-10-14 10:14:54,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1677158.0, ans=0.0 2023-10-14 10:14:59,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1677158.0, ans=0.04949747468305833 2023-10-14 10:15:19,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1677204.6666666667, ans=0.125 2023-10-14 10:15:38,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.862e+02 1.982e+02 2.124e+02 2.987e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 10:15:38,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1677298.0, ans=0.2 2023-10-14 10:15:52,583 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:16:11,971 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:16:50,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1677531.3333333333, ans=15.0 2023-10-14 10:16:54,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1677531.3333333333, ans=0.04949747468305833 2023-10-14 10:16:59,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1677578.0, ans=0.125 2023-10-14 10:16:59,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677578.0, ans=0.1 2023-10-14 10:17:02,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1677578.0, ans=0.0 2023-10-14 10:17:03,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1677578.0, ans=0.125 2023-10-14 10:17:20,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677624.6666666667, ans=0.1 2023-10-14 10:17:49,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.73 vs. limit=10.0 2023-10-14 10:17:54,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1677764.6666666667, ans=0.2 2023-10-14 10:17:54,715 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.827e+02 2.002e+02 2.289e+02 4.093e+02, threshold=4.005e+02, percent-clipped=1.0 2023-10-14 10:18:04,806 INFO [train.py:1031] (3/4) Epoch 27, batch 4500, loss[loss=0.1698, simple_loss=0.2584, pruned_loss=0.04059, over 16109.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2777, pruned_loss=0.04648, over 29377677.25 frames. ], batch size: 43, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:18:25,338 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-10-14 10:18:27,852 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:18:44,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1677951.3333333333, ans=0.07 2023-10-14 10:18:54,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1677998.0, ans=0.0 2023-10-14 10:19:18,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1678091.3333333333, ans=0.0 2023-10-14 10:19:36,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1678184.6666666667, ans=0.0 2023-10-14 10:19:42,797 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:19:49,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.770e+02 1.916e+02 2.101e+02 2.914e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-14 10:20:03,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678278.0, ans=0.1 2023-10-14 10:20:28,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1678371.3333333333, ans=0.125 2023-10-14 10:20:36,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1678418.0, ans=0.0 2023-10-14 10:20:45,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1678464.6666666667, ans=0.0 2023-10-14 10:20:52,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-10-14 10:21:23,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1678604.6666666667, ans=0.2 2023-10-14 10:21:45,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.886e+02 2.008e+02 2.244e+02 3.092e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-14 10:22:00,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1678744.6666666667, ans=0.125 2023-10-14 10:22:03,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=22.5 2023-10-14 10:22:13,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1678791.3333333333, ans=0.125 2023-10-14 10:22:24,661 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1678838.0, ans=0.125 2023-10-14 10:22:31,581 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2023-10-14 10:22:42,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1678931.3333333333, ans=0.035 2023-10-14 10:23:32,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1679118.0, ans=0.125 2023-10-14 10:23:34,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679118.0, ans=0.1 2023-10-14 10:23:34,584 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.95 vs. limit=6.0 2023-10-14 10:23:38,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1679118.0, ans=0.04949747468305833 2023-10-14 10:23:53,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.779e+02 1.925e+02 2.109e+02 2.801e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 10:24:37,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1679304.6666666667, ans=0.125 2023-10-14 10:25:13,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1679444.6666666667, ans=0.0 2023-10-14 10:25:22,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1679491.3333333333, ans=0.125 2023-10-14 10:25:33,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1679538.0, ans=0.2 2023-10-14 10:25:57,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1679631.3333333333, ans=0.0 2023-10-14 10:26:00,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.909e+02 2.031e+02 2.243e+02 3.076e+02, threshold=4.063e+02, percent-clipped=0.0 2023-10-14 10:26:41,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679771.3333333333, ans=0.1 2023-10-14 10:26:46,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1679818.0, ans=0.0 2023-10-14 10:26:52,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1679818.0, ans=0.0 2023-10-14 10:27:13,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679864.6666666667, ans=0.1 2023-10-14 10:27:18,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.48 vs. limit=6.0 2023-10-14 10:27:19,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1679911.3333333333, ans=0.125 2023-10-14 10:27:28,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.64 vs. limit=22.5 2023-10-14 10:28:00,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=12.0 2023-10-14 10:28:04,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1680051.3333333333, ans=0.0 2023-10-14 10:28:23,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.833e+02 1.976e+02 2.157e+02 3.051e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-14 10:28:27,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1680098.0, ans=0.125 2023-10-14 10:28:32,622 INFO [train.py:1031] (3/4) Epoch 27, batch 5000, loss[loss=0.1723, simple_loss=0.2638, pruned_loss=0.04043, over 16828.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2776, pruned_loss=0.04671, over 30143998.41 frames. ], batch size: 72, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:28:53,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1680191.3333333333, ans=0.07 2023-10-14 10:29:17,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1680284.6666666667, ans=0.125 2023-10-14 10:29:20,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1680284.6666666667, ans=0.0 2023-10-14 10:29:21,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1680284.6666666667, ans=0.125 2023-10-14 10:29:21,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1680284.6666666667, ans=0.125 2023-10-14 10:29:38,792 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-10-14 10:29:43,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1680378.0, ans=0.125 2023-10-14 10:29:43,570 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=22.5 2023-10-14 10:29:56,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1680424.6666666667, ans=0.125 2023-10-14 10:29:57,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680424.6666666667, ans=0.1 2023-10-14 10:30:21,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1680518.0, ans=0.0 2023-10-14 10:30:30,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.833e+02 1.992e+02 2.209e+02 2.946e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-14 10:30:55,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1680658.0, ans=0.2 2023-10-14 10:31:16,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1680704.6666666667, ans=0.125 2023-10-14 10:31:20,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1680751.3333333333, ans=0.0 2023-10-14 10:31:31,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1680798.0, ans=0.125 2023-10-14 10:31:33,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680798.0, ans=0.1 2023-10-14 10:31:46,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1680844.6666666667, ans=0.125 2023-10-14 10:31:48,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.92 vs. limit=15.0 2023-10-14 10:31:57,501 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.52 vs. limit=22.5 2023-10-14 10:32:21,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1680938.0, ans=0.125 2023-10-14 10:32:31,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1680984.6666666667, ans=0.2 2023-10-14 10:32:35,913 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:32:49,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.819e+02 2.030e+02 2.220e+02 3.325e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-14 10:32:54,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1681031.3333333333, ans=0.0 2023-10-14 10:32:54,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1681031.3333333333, ans=0.04949747468305833 2023-10-14 10:33:04,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1681078.0, ans=0.125 2023-10-14 10:33:09,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1681078.0, ans=0.125 2023-10-14 10:33:20,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1681124.6666666667, ans=0.125 2023-10-14 10:33:29,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1681171.3333333333, ans=0.125 2023-10-14 10:33:41,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1681218.0, ans=0.125 2023-10-14 10:33:46,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1681218.0, ans=10.0 2023-10-14 10:33:55,738 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-10-14 10:34:03,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1681264.6666666667, ans=0.125 2023-10-14 10:34:22,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1681358.0, ans=0.0 2023-10-14 10:34:28,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-14 10:34:47,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-10-14 10:34:51,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1681451.3333333333, ans=0.125 2023-10-14 10:35:09,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1681498.0, ans=0.2 2023-10-14 10:35:10,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.866e+02 2.075e+02 2.287e+02 3.292e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-14 10:35:52,595 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:35:54,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1681638.0, ans=0.125 2023-10-14 10:36:10,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1681684.6666666667, ans=0.125 2023-10-14 10:36:34,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1681778.0, ans=0.125 2023-10-14 10:36:39,598 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-10-14 10:36:41,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1681824.6666666667, ans=0.125 2023-10-14 10:36:41,611 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.36 vs. limit=15.0 2023-10-14 10:36:45,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1681824.6666666667, ans=0.2 2023-10-14 10:36:50,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.50 vs. limit=22.5 2023-10-14 10:37:04,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1681918.0, ans=0.09899494936611666 2023-10-14 10:37:26,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.763e+02 1.890e+02 2.133e+02 2.998e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-14 10:37:31,569 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:38:06,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1682104.6666666667, ans=0.09899494936611666 2023-10-14 10:38:26,752 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=22.5 2023-10-14 10:38:55,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1682244.6666666667, ans=0.0 2023-10-14 10:39:06,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1682244.6666666667, ans=15.0 2023-10-14 10:39:31,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1682384.6666666667, ans=0.125 2023-10-14 10:39:40,973 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.46 vs. limit=15.0 2023-10-14 10:39:51,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.813e+02 1.997e+02 2.209e+02 2.946e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-14 10:39:56,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1682478.0, ans=0.0 2023-10-14 10:39:57,041 INFO [train.py:1031] (3/4) Epoch 27, batch 5500, loss[loss=0.1742, simple_loss=0.269, pruned_loss=0.03972, over 16928.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2774, pruned_loss=0.04664, over 30710414.14 frames. ], batch size: 130, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 10:40:03,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1682478.0, ans=0.2 2023-10-14 10:40:05,610 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:40:16,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1682524.6666666667, ans=0.125 2023-10-14 10:40:33,606 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=15.0 2023-10-14 10:41:02,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-14 10:41:06,575 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1682758.0, ans=0.125 2023-10-14 10:41:18,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1682804.6666666667, ans=0.0 2023-10-14 10:41:51,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.740e+02 1.902e+02 2.192e+02 2.995e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-14 10:41:53,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1682898.0, ans=0.125 2023-10-14 10:42:05,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-10-14 10:42:43,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1683038.0, ans=0.2 2023-10-14 10:42:45,792 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-10-14 10:42:58,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1683084.6666666667, ans=0.07 2023-10-14 10:42:59,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1683131.3333333333, ans=0.1 2023-10-14 10:43:19,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1683178.0, ans=0.125 2023-10-14 10:43:37,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-14 10:43:39,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1683271.3333333333, ans=0.1 2023-10-14 10:43:39,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-10-14 10:43:48,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1683271.3333333333, ans=0.125 2023-10-14 10:44:09,389 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-10-14 10:44:11,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1683364.6666666667, ans=0.2 2023-10-14 10:44:14,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1683364.6666666667, ans=0.125 2023-10-14 10:44:14,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.833e+02 2.004e+02 2.154e+02 3.075e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-14 10:44:22,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1683411.3333333333, ans=0.1 2023-10-14 10:45:08,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1683551.3333333333, ans=0.125 2023-10-14 10:45:09,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-10-14 10:45:31,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1683644.6666666667, ans=0.0 2023-10-14 10:45:34,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1683644.6666666667, ans=0.1 2023-10-14 10:45:34,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1683644.6666666667, ans=0.0 2023-10-14 10:45:39,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1683691.3333333333, ans=0.125 2023-10-14 10:45:43,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1683691.3333333333, ans=0.1 2023-10-14 10:45:48,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1683691.3333333333, ans=0.125 2023-10-14 10:46:05,340 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-14 10:46:10,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683784.6666666667, ans=0.1 2023-10-14 10:46:22,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.807e+02 1.967e+02 2.206e+02 2.804e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 10:46:25,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1683831.3333333333, ans=0.125 2023-10-14 10:46:34,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1683878.0, ans=0.125 2023-10-14 10:46:39,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-10-14 10:46:53,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1683924.6666666667, ans=0.125 2023-10-14 10:48:07,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1684158.0, ans=0.0 2023-10-14 10:48:50,399 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-14 10:48:55,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.742e+02 1.925e+02 2.130e+02 2.850e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 10:49:07,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1684344.6666666667, ans=0.05 2023-10-14 10:49:20,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684391.3333333333, ans=0.1 2023-10-14 10:49:30,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1684438.0, ans=0.0 2023-10-14 10:49:35,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1684438.0, ans=0.0 2023-10-14 10:49:38,727 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1684438.0, ans=0.125 2023-10-14 10:49:51,858 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.88 vs. limit=12.0 2023-10-14 10:49:56,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1684531.3333333333, ans=0.0 2023-10-14 10:50:00,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1684531.3333333333, ans=0.125 2023-10-14 10:50:05,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1684531.3333333333, ans=0.2 2023-10-14 10:50:11,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1684578.0, ans=0.125 2023-10-14 10:50:35,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1684671.3333333333, ans=0.125 2023-10-14 10:50:42,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1684671.3333333333, ans=0.2 2023-10-14 10:50:46,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1684718.0, ans=0.5 2023-10-14 10:51:05,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.837e+02 1.998e+02 2.184e+02 2.794e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-14 10:51:11,259 INFO [train.py:1031] (3/4) Epoch 27, batch 6000, loss[loss=0.2259, simple_loss=0.3043, pruned_loss=0.07375, over 16450.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2777, pruned_loss=0.04693, over 31175776.23 frames. ], batch size: 266, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:51:13,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1684811.3333333333, ans=0.2 2023-10-14 10:51:23,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1684858.0, ans=15.0 2023-10-14 10:51:31,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1684858.0, ans=0.125 2023-10-14 10:52:01,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1684998.0, ans=0.0 2023-10-14 10:52:17,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1684998.0, ans=0.125 2023-10-14 10:52:26,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1685044.6666666667, ans=0.0 2023-10-14 10:53:12,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1685231.3333333333, ans=0.1 2023-10-14 10:53:17,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.808e+02 1.970e+02 2.237e+02 2.992e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 10:54:02,864 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-10-14 10:54:58,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1685558.0, ans=0.125 2023-10-14 10:54:59,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=22.5 2023-10-14 10:55:23,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-14 10:55:40,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.886e+02 2.058e+02 2.244e+02 3.385e+02, threshold=4.117e+02, percent-clipped=0.0 2023-10-14 10:56:27,357 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2023-10-14 10:56:38,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1685931.3333333333, ans=0.2 2023-10-14 10:56:41,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1685931.3333333333, ans=0.0 2023-10-14 10:56:50,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=1685978.0, ans=0.02 2023-10-14 10:57:24,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1686071.3333333333, ans=0.1 2023-10-14 10:57:26,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1686071.3333333333, ans=0.125 2023-10-14 10:57:26,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1686071.3333333333, ans=0.125 2023-10-14 10:57:33,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1686118.0, ans=0.0 2023-10-14 10:57:52,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.852e+02 1.962e+02 2.185e+02 3.142e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-14 10:57:58,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1686211.3333333333, ans=0.125 2023-10-14 10:58:02,839 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-14 10:58:47,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1686351.3333333333, ans=0.125 2023-10-14 10:59:33,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1686491.3333333333, ans=0.05 2023-10-14 11:00:25,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.827e+02 1.947e+02 2.089e+02 2.974e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-14 11:00:36,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1686678.0, ans=0.125 2023-10-14 11:00:46,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.47 vs. limit=15.0 2023-10-14 11:01:22,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-10-14 11:01:26,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1686864.6666666667, ans=0.0 2023-10-14 11:01:41,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1686911.3333333333, ans=0.125 2023-10-14 11:01:47,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1686911.3333333333, ans=0.0 2023-10-14 11:02:00,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1686958.0, ans=0.125 2023-10-14 11:02:34,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1687098.0, ans=0.125 2023-10-14 11:02:35,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-10-14 11:02:38,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1687098.0, ans=0.125 2023-10-14 11:02:41,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.773e+02 2.026e+02 2.262e+02 3.217e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-14 11:02:47,186 INFO [train.py:1031] (3/4) Epoch 27, batch 6500, loss[loss=0.1903, simple_loss=0.2817, pruned_loss=0.04952, over 16625.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2782, pruned_loss=0.04713, over 31513866.80 frames. ], batch size: 61, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:03:01,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1687191.3333333333, ans=0.125 2023-10-14 11:03:25,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1687238.0, ans=0.025 2023-10-14 11:03:30,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687238.0, ans=0.1 2023-10-14 11:04:12,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1687378.0, ans=0.0 2023-10-14 11:04:25,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1687378.0, ans=0.1 2023-10-14 11:04:25,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2023-10-14 11:04:29,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-10-14 11:04:44,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1687424.6666666667, ans=0.0 2023-10-14 11:04:47,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1687471.3333333333, ans=0.0 2023-10-14 11:04:53,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1687471.3333333333, ans=0.125 2023-10-14 11:04:58,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1687471.3333333333, ans=0.0 2023-10-14 11:05:02,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1687471.3333333333, ans=0.125 2023-10-14 11:05:18,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1687518.0, ans=0.125 2023-10-14 11:05:21,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1687518.0, ans=0.0 2023-10-14 11:05:27,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1687564.6666666667, ans=0.125 2023-10-14 11:05:27,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1687564.6666666667, ans=0.07 2023-10-14 11:05:27,684 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-10-14 11:05:37,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.931e+02 2.079e+02 2.326e+02 2.972e+02, threshold=4.158e+02, percent-clipped=0.0 2023-10-14 11:05:53,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=1687611.3333333333, ans=12.0 2023-10-14 11:06:28,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1687704.6666666667, ans=0.0 2023-10-14 11:06:30,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1687751.3333333333, ans=0.1 2023-10-14 11:06:32,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1687751.3333333333, ans=0.125 2023-10-14 11:06:33,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1687751.3333333333, ans=0.125 2023-10-14 11:06:42,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1687798.0, ans=0.125 2023-10-14 11:06:53,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1687798.0, ans=0.2 2023-10-14 11:07:03,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1687844.6666666667, ans=0.1 2023-10-14 11:07:16,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-10-14 11:07:19,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1687891.3333333333, ans=0.025 2023-10-14 11:07:27,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1687938.0, ans=0.09899494936611666 2023-10-14 11:07:43,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.36 vs. limit=22.5 2023-10-14 11:07:55,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1688031.3333333333, ans=0.0 2023-10-14 11:07:58,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.830e+02 1.997e+02 2.217e+02 3.260e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-14 11:08:18,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.33 vs. limit=10.0 2023-10-14 11:08:31,961 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=15.0 2023-10-14 11:08:34,636 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.02 vs. limit=22.5 2023-10-14 11:08:41,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1688218.0, ans=0.125 2023-10-14 11:08:42,960 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=22.5 2023-10-14 11:08:59,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1688264.6666666667, ans=0.125 2023-10-14 11:09:01,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2023-10-14 11:09:02,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1688264.6666666667, ans=0.125 2023-10-14 11:09:15,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1688311.3333333333, ans=10.0 2023-10-14 11:09:45,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1688404.6666666667, ans=0.125 2023-10-14 11:09:46,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1688404.6666666667, ans=0.125 2023-10-14 11:09:57,046 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2023-10-14 11:10:18,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.742e+02 1.976e+02 2.183e+02 4.121e+02, threshold=3.952e+02, percent-clipped=1.0 2023-10-14 11:10:24,160 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1688544.6666666667, ans=0.0 2023-10-14 11:10:44,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1688591.3333333333, ans=0.2 2023-10-14 11:10:47,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1688591.3333333333, ans=0.125 2023-10-14 11:11:21,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1688684.6666666667, ans=0.125 2023-10-14 11:11:57,471 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1688778.0, ans=0.0 2023-10-14 11:12:18,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1688824.6666666667, ans=0.125 2023-10-14 11:12:26,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1688871.3333333333, ans=0.2 2023-10-14 11:12:47,541 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:12:59,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1688964.6666666667, ans=0.125 2023-10-14 11:13:02,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.798e+02 1.919e+02 2.145e+02 3.215e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-14 11:13:04,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1689011.3333333333, ans=0.0 2023-10-14 11:13:04,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1689011.3333333333, ans=0.0 2023-10-14 11:13:09,661 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=22.5 2023-10-14 11:13:13,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.82 vs. limit=15.0 2023-10-14 11:13:15,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1689011.3333333333, ans=0.0 2023-10-14 11:13:16,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1689011.3333333333, ans=0.0 2023-10-14 11:13:22,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1689058.0, ans=0.0 2023-10-14 11:13:58,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1689198.0, ans=0.07 2023-10-14 11:14:26,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1689291.3333333333, ans=0.1 2023-10-14 11:14:28,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.61 vs. limit=15.0 2023-10-14 11:14:46,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1689338.0, ans=0.2 2023-10-14 11:15:09,267 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.64 vs. limit=22.5 2023-10-14 11:15:15,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.885e+02 2.002e+02 2.279e+02 3.290e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-14 11:15:17,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1689478.0, ans=0.0 2023-10-14 11:15:18,605 INFO [train.py:1031] (3/4) Epoch 27, batch 7000, loss[loss=0.1854, simple_loss=0.2794, pruned_loss=0.04571, over 16522.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2786, pruned_loss=0.04692, over 31804740.88 frames. ], batch size: 56, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:15:24,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1689478.0, ans=0.2 2023-10-14 11:15:26,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1689478.0, ans=0.95 2023-10-14 11:15:40,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1689524.6666666667, ans=0.2 2023-10-14 11:15:49,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1689571.3333333333, ans=10.0 2023-10-14 11:16:03,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689571.3333333333, ans=0.125 2023-10-14 11:16:15,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1689618.0, ans=0.125 2023-10-14 11:16:27,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1689664.6666666667, ans=0.2 2023-10-14 11:16:34,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1689664.6666666667, ans=0.125 2023-10-14 11:17:14,774 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=3.97 vs. limit=12.0 2023-10-14 11:17:58,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.895e+02 2.043e+02 2.334e+02 3.118e+02, threshold=4.085e+02, percent-clipped=0.0 2023-10-14 11:18:00,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1689898.0, ans=0.125 2023-10-14 11:18:08,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1689944.6666666667, ans=0.125 2023-10-14 11:18:13,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1689944.6666666667, ans=0.125 2023-10-14 11:18:19,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=12.0 2023-10-14 11:19:11,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1690131.3333333333, ans=0.0 2023-10-14 11:19:22,055 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.32 vs. limit=10.0 2023-10-14 11:19:24,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1690178.0, ans=0.0 2023-10-14 11:19:28,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690178.0, ans=0.1 2023-10-14 11:19:38,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=22.5 2023-10-14 11:19:48,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1690271.3333333333, ans=0.05 2023-10-14 11:19:50,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1690271.3333333333, ans=0.2 2023-10-14 11:20:25,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1690364.6666666667, ans=0.125 2023-10-14 11:20:27,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.810e+02 2.052e+02 2.313e+02 3.310e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-14 11:20:41,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690411.3333333333, ans=0.1 2023-10-14 11:20:42,147 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.96 vs. limit=22.5 2023-10-14 11:21:12,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1690458.0, ans=0.125 2023-10-14 11:21:16,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1690504.6666666667, ans=0.0 2023-10-14 11:21:33,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1690551.3333333333, ans=0.125 2023-10-14 11:21:57,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1690598.0, ans=0.125 2023-10-14 11:23:23,043 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=15.0 2023-10-14 11:23:45,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690831.3333333333, ans=0.1 2023-10-14 11:23:48,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.845e+02 2.084e+02 2.364e+02 3.269e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-14 11:24:05,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1690878.0, ans=0.125 2023-10-14 11:24:21,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1690924.6666666667, ans=0.125 2023-10-14 11:24:30,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1690971.3333333333, ans=0.125 2023-10-14 11:25:10,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1691064.6666666667, ans=0.125 2023-10-14 11:25:15,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1691064.6666666667, ans=0.2 2023-10-14 11:25:30,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=12.0 2023-10-14 11:25:31,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1691111.3333333333, ans=0.1 2023-10-14 11:25:50,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1691158.0, ans=10.0 2023-10-14 11:25:53,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1691158.0, ans=10.0 2023-10-14 11:26:17,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1691204.6666666667, ans=0.0 2023-10-14 11:26:20,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1691251.3333333333, ans=0.5 2023-10-14 11:27:00,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.814e+02 2.058e+02 2.236e+02 3.070e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-14 11:27:12,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=15.0 2023-10-14 11:27:25,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1691391.3333333333, ans=0.1 2023-10-14 11:27:48,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1691438.0, ans=0.125 2023-10-14 11:28:58,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1691624.6666666667, ans=0.1 2023-10-14 11:29:52,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1691718.0, ans=0.1 2023-10-14 11:30:10,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1691764.6666666667, ans=10.0 2023-10-14 11:30:12,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1691764.6666666667, ans=0.125 2023-10-14 11:30:12,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1691764.6666666667, ans=0.125 2023-10-14 11:30:20,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.863e+02 2.070e+02 2.325e+02 3.365e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-14 11:30:24,412 INFO [train.py:1031] (3/4) Epoch 27, batch 7500, loss[loss=0.2028, simple_loss=0.2938, pruned_loss=0.05593, over 16891.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2784, pruned_loss=0.04687, over 32020336.79 frames. ], batch size: 87, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:31:45,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1691998.0, ans=0.125 2023-10-14 11:32:00,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1692044.6666666667, ans=0.1 2023-10-14 11:32:28,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.73 vs. limit=22.5 2023-10-14 11:32:56,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1692184.6666666667, ans=10.0 2023-10-14 11:33:21,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.902e+02 2.110e+02 2.353e+02 3.247e+02, threshold=4.221e+02, percent-clipped=0.0 2023-10-14 11:33:57,903 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.70 vs. limit=15.0 2023-10-14 11:34:03,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1692371.3333333333, ans=0.125 2023-10-14 11:34:35,814 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1692418.0, ans=0.125 2023-10-14 11:34:54,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1692464.6666666667, ans=0.125 2023-10-14 11:35:16,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1692511.3333333333, ans=10.0 2023-10-14 11:35:30,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1692558.0, ans=0.125 2023-10-14 11:36:52,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1692698.0, ans=0.2 2023-10-14 11:36:53,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.830e+02 1.946e+02 2.162e+02 2.979e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-14 11:38:40,264 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=12.0 2023-10-14 11:38:57,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1692978.0, ans=0.0 2023-10-14 11:39:14,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1693024.6666666667, ans=0.1 2023-10-14 11:39:35,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=22.5 2023-10-14 11:40:03,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1693164.6666666667, ans=0.0 2023-10-14 11:40:08,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1693164.6666666667, ans=0.05 2023-10-14 11:40:17,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.849e+02 1.982e+02 2.118e+02 2.687e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 11:40:28,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1693211.3333333333, ans=0.2 2023-10-14 11:40:59,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.74 vs. limit=10.0 2023-10-14 11:41:01,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-14 11:41:33,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1693351.3333333333, ans=10.0 2023-10-14 11:41:34,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1693351.3333333333, ans=0.125 2023-10-14 11:41:42,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1693398.0, ans=0.1 2023-10-14 11:42:14,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1693444.6666666667, ans=0.2 2023-10-14 11:42:25,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1693491.3333333333, ans=0.0 2023-10-14 11:42:27,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1693491.3333333333, ans=0.125 2023-10-14 11:42:27,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1693491.3333333333, ans=0.125 2023-10-14 11:42:35,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.24 vs. limit=15.0 2023-10-14 11:43:40,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1693584.6666666667, ans=0.125 2023-10-14 11:43:53,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1693631.3333333333, ans=0.0 2023-10-14 11:44:15,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.789e+02 1.962e+02 2.292e+02 3.407e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-14 11:44:16,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-10-14 11:44:25,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1693678.0, ans=0.125 2023-10-14 11:44:52,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1693678.0, ans=0.0 2023-10-14 11:45:33,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-14 11:47:20,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1693958.0, ans=0.125 2023-10-14 11:47:31,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1694004.6666666667, ans=0.125 2023-10-14 11:47:53,980 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.99 vs. limit=22.5 2023-10-14 11:47:57,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1694051.3333333333, ans=0.125 2023-10-14 11:48:26,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.726e+02 1.863e+02 2.132e+02 2.948e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-14 11:48:30,902 INFO [train.py:1031] (3/4) Epoch 27, batch 8000, loss[loss=0.1612, simple_loss=0.2681, pruned_loss=0.02717, over 16924.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2778, pruned_loss=0.04647, over 32180124.23 frames. ], batch size: 93, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:49:00,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1694238.0, ans=0.125 2023-10-14 11:49:30,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1694331.3333333333, ans=0.0 2023-10-14 11:49:36,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-14 11:49:43,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1694378.0, ans=0.125 2023-10-14 11:49:48,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1694378.0, ans=0.125 2023-10-14 11:50:06,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.25 vs. limit=15.0 2023-10-14 11:50:15,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1694518.0, ans=0.0 2023-10-14 11:50:16,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1694518.0, ans=0.1 2023-10-14 11:50:35,734 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-10-14 11:50:36,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.732e+02 1.871e+02 2.069e+02 2.971e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-14 11:50:59,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1694658.0, ans=0.1 2023-10-14 11:51:09,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1694704.6666666667, ans=0.05 2023-10-14 11:51:21,798 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:51:41,843 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-14 11:51:44,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1694844.6666666667, ans=0.125 2023-10-14 11:51:46,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1694844.6666666667, ans=0.0 2023-10-14 11:51:49,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1694844.6666666667, ans=0.1 2023-10-14 11:51:51,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=22.5 2023-10-14 11:51:55,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.39 vs. limit=15.0 2023-10-14 11:52:20,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1694891.3333333333, ans=0.125 2023-10-14 11:52:22,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1694938.0, ans=0.125 2023-10-14 11:52:43,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2023-10-14 11:52:49,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1694984.6666666667, ans=0.0 2023-10-14 11:52:53,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1695031.3333333333, ans=0.2 2023-10-14 11:53:08,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.798e+02 2.003e+02 2.195e+02 2.903e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-14 11:53:39,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1695171.3333333333, ans=0.125 2023-10-14 11:53:48,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1695218.0, ans=0.125 2023-10-14 11:53:50,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1695218.0, ans=0.07 2023-10-14 11:53:50,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1695218.0, ans=0.125 2023-10-14 11:54:05,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.51 vs. limit=15.0 2023-10-14 11:54:53,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1695451.3333333333, ans=0.125 2023-10-14 11:54:59,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1695451.3333333333, ans=0.1 2023-10-14 11:55:05,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1695498.0, ans=0.125 2023-10-14 11:55:05,771 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-10-14 11:55:09,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1695498.0, ans=0.0 2023-10-14 11:55:16,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.763e+02 1.934e+02 2.175e+02 3.068e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-14 11:55:16,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1695544.6666666667, ans=0.125 2023-10-14 11:55:42,265 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1695638.0, ans=0.125 2023-10-14 11:55:54,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695684.6666666667, ans=0.1 2023-10-14 11:56:05,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1695731.3333333333, ans=0.125 2023-10-14 11:56:14,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695731.3333333333, ans=0.1 2023-10-14 11:56:22,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1695778.0, ans=0.125 2023-10-14 11:56:27,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1695778.0, ans=0.125 2023-10-14 11:56:32,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1695824.6666666667, ans=0.035 2023-10-14 11:56:44,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1695871.3333333333, ans=0.2 2023-10-14 11:56:50,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.10 vs. limit=15.0 2023-10-14 11:56:56,764 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.54 vs. limit=15.0 2023-10-14 11:57:07,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1695918.0, ans=0.0 2023-10-14 11:57:19,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1695964.6666666667, ans=0.125 2023-10-14 11:57:21,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1695964.6666666667, ans=0.05 2023-10-14 11:57:22,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1695964.6666666667, ans=0.025 2023-10-14 11:57:26,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.845e+02 1.970e+02 2.193e+02 2.917e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-14 11:57:35,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1696011.3333333333, ans=0.125 2023-10-14 11:57:54,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-10-14 11:58:06,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1696104.6666666667, ans=0.125 2023-10-14 11:58:12,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1696151.3333333333, ans=0.0 2023-10-14 11:58:54,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1696291.3333333333, ans=0.125 2023-10-14 11:59:05,358 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1696338.0, ans=0.125 2023-10-14 11:59:27,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=15.0 2023-10-14 11:59:42,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.41 vs. limit=22.5 2023-10-14 11:59:47,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.842e+02 2.027e+02 2.240e+02 4.078e+02, threshold=4.054e+02, percent-clipped=1.0 2023-10-14 11:59:47,971 INFO [train.py:1031] (3/4) Epoch 27, batch 8500, loss[loss=0.1871, simple_loss=0.2851, pruned_loss=0.04456, over 16850.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2782, pruned_loss=0.04649, over 32320379.67 frames. ], batch size: 175, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 12:00:00,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1696524.6666666667, ans=0.1 2023-10-14 12:00:44,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696664.6666666667, ans=0.1 2023-10-14 12:01:01,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1696711.3333333333, ans=0.125 2023-10-14 12:01:18,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1696758.0, ans=0.0 2023-10-14 12:01:24,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1696804.6666666667, ans=0.125 2023-10-14 12:01:56,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-14 12:02:03,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.895e+02 2.088e+02 2.354e+02 3.023e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-14 12:03:47,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697271.3333333333, ans=0.1 2023-10-14 12:04:17,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697364.6666666667, ans=0.1 2023-10-14 12:04:35,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.754e+02 1.904e+02 2.182e+02 3.245e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-14 12:04:52,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1697458.0, ans=0.95 2023-10-14 12:04:52,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1697458.0, ans=0.125 2023-10-14 12:04:52,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-14 12:05:45,510 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.87 vs. limit=15.0 2023-10-14 12:05:46,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1697598.0, ans=0.035 2023-10-14 12:06:05,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1697644.6666666667, ans=0.0 2023-10-14 12:06:33,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1697738.0, ans=0.0 2023-10-14 12:06:43,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1697738.0, ans=0.2 2023-10-14 12:07:06,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697831.3333333333, ans=0.1 2023-10-14 12:07:10,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-14 12:07:21,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.726e+02 1.883e+02 2.126e+02 2.624e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-14 12:07:44,566 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697924.6666666667, ans=0.1 2023-10-14 12:07:55,493 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=12.0 2023-10-14 12:08:02,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1697971.3333333333, ans=0.125 2023-10-14 12:08:02,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1697971.3333333333, ans=0.0 2023-10-14 12:08:16,461 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-10-14 12:08:27,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1698064.6666666667, ans=0.125 2023-10-14 12:08:31,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1698064.6666666667, ans=0.125 2023-10-14 12:09:07,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1698158.0, ans=0.0 2023-10-14 12:09:10,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1698158.0, ans=0.125 2023-10-14 12:09:52,025 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-10-14 12:10:06,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.793e+02 1.960e+02 2.156e+02 3.979e+02, threshold=3.920e+02, percent-clipped=1.0 2023-10-14 12:10:27,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1698391.3333333333, ans=0.125 2023-10-14 12:10:32,777 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:11:16,753 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=12.0 2023-10-14 12:11:17,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1698531.3333333333, ans=0.125 2023-10-14 12:12:18,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1698718.0, ans=0.0 2023-10-14 12:12:35,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1698764.6666666667, ans=0.125 2023-10-14 12:12:38,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1698764.6666666667, ans=0.0 2023-10-14 12:12:42,646 INFO [train.py:1031] (3/4) Epoch 27, batch 9000, loss[loss=0.169, simple_loss=0.276, pruned_loss=0.03094, over 16826.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04615, over 32447420.39 frames. ], batch size: 98, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 12:12:43,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.788e+02 1.991e+02 2.198e+02 3.303e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-14 12:13:09,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1698858.0, ans=0.125 2023-10-14 12:13:17,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-10-14 12:13:31,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1698951.3333333333, ans=0.125 2023-10-14 12:13:42,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1698998.0, ans=0.2 2023-10-14 12:13:42,641 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:13:56,987 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1698998.0, ans=0.0 2023-10-14 12:13:58,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1698998.0, ans=0.125 2023-10-14 12:14:13,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1699091.3333333333, ans=0.0 2023-10-14 12:14:23,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1699091.3333333333, ans=0.0 2023-10-14 12:14:28,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1699091.3333333333, ans=0.0 2023-10-14 12:14:58,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699184.6666666667, ans=0.1 2023-10-14 12:15:00,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1699184.6666666667, ans=0.1 2023-10-14 12:15:05,153 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-10-14 12:15:17,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.788e+02 1.911e+02 2.190e+02 2.862e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-14 12:15:24,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1699278.0, ans=0.125 2023-10-14 12:15:26,378 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2023-10-14 12:16:20,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1699418.0, ans=0.125 2023-10-14 12:16:42,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1699511.3333333333, ans=0.125 2023-10-14 12:16:44,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1699511.3333333333, ans=0.125 2023-10-14 12:17:14,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-14 12:17:15,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1699558.0, ans=0.125 2023-10-14 12:17:45,792 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:17:57,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1699698.0, ans=0.125 2023-10-14 12:18:08,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1699698.0, ans=0.2 2023-10-14 12:18:08,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1699698.0, ans=0.0 2023-10-14 12:18:14,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.855e+02 2.006e+02 2.321e+02 3.044e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 12:18:33,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1699791.3333333333, ans=10.0 2023-10-14 12:18:56,694 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=15.0 2023-10-14 12:19:07,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1699884.6666666667, ans=0.125 2023-10-14 12:19:17,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1699884.6666666667, ans=0.2 2023-10-14 12:19:18,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1699931.3333333333, ans=0.125 2023-10-14 12:20:16,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-10-14 12:20:24,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1700118.0, ans=0.0 2023-10-14 12:20:26,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1700118.0, ans=0.0 2023-10-14 12:20:47,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1700164.6666666667, ans=0.2 2023-10-14 12:21:03,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.867e+02 2.001e+02 2.359e+02 3.015e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-14 12:21:11,469 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=15.0 2023-10-14 12:21:13,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1700211.3333333333, ans=0.0 2023-10-14 12:22:07,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1700398.0, ans=0.125 2023-10-14 12:22:22,489 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=15.0 2023-10-14 12:22:43,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1700444.6666666667, ans=0.0 2023-10-14 12:22:54,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1700491.3333333333, ans=0.2 2023-10-14 12:23:08,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1700491.3333333333, ans=0.125 2023-10-14 12:23:41,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1700584.6666666667, ans=0.2 2023-10-14 12:23:44,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1700584.6666666667, ans=0.125 2023-10-14 12:24:16,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1700678.0, ans=0.2 2023-10-14 12:24:19,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.862e+02 2.114e+02 2.421e+02 3.565e+02, threshold=4.228e+02, percent-clipped=0.0 2023-10-14 12:24:44,461 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:24:49,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1700724.6666666667, ans=0.125 2023-10-14 12:25:17,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1700818.0, ans=0.1 2023-10-14 12:25:35,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1700864.6666666667, ans=0.125 2023-10-14 12:25:41,916 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.35 vs. limit=22.5 2023-10-14 12:25:41,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.46 vs. limit=15.0 2023-10-14 12:26:00,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1700911.3333333333, ans=0.05 2023-10-14 12:26:52,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1701051.3333333333, ans=0.0 2023-10-14 12:27:20,204 INFO [train.py:1031] (3/4) Epoch 27, batch 9500, loss[loss=0.1925, simple_loss=0.2919, pruned_loss=0.04654, over 16889.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2784, pruned_loss=0.04647, over 32536911.00 frames. ], batch size: 165, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 12:27:25,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.860e+02 2.035e+02 2.255e+02 3.674e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-14 12:28:36,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1701284.6666666667, ans=0.125 2023-10-14 12:29:03,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701331.3333333333, ans=0.1 2023-10-14 12:29:12,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-14 12:29:18,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701378.0, ans=0.1 2023-10-14 12:29:43,576 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.01 vs. limit=15.0 2023-10-14 12:29:54,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=8.0 2023-10-14 12:29:58,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1701471.3333333333, ans=0.125 2023-10-14 12:30:07,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1701471.3333333333, ans=0.09899494936611666 2023-10-14 12:30:53,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1701564.6666666667, ans=0.025 2023-10-14 12:30:53,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1701564.6666666667, ans=0.0 2023-10-14 12:31:04,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.878e+02 2.031e+02 2.227e+02 3.316e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-14 12:31:11,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1701611.3333333333, ans=0.0 2023-10-14 12:31:54,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1701704.6666666667, ans=0.125 2023-10-14 12:32:02,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1701704.6666666667, ans=0.2 2023-10-14 12:32:31,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1701751.3333333333, ans=0.2 2023-10-14 12:32:59,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1701798.0, ans=0.09899494936611666 2023-10-14 12:33:16,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1701844.6666666667, ans=0.125 2023-10-14 12:34:43,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-10-14 12:35:11,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=12.0 2023-10-14 12:35:13,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.817e+02 1.988e+02 2.380e+02 4.107e+02, threshold=3.977e+02, percent-clipped=1.0 2023-10-14 12:35:23,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1702124.6666666667, ans=0.125 2023-10-14 12:35:31,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702124.6666666667, ans=0.1 2023-10-14 12:35:43,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1702171.3333333333, ans=0.0 2023-10-14 12:35:44,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1702171.3333333333, ans=0.0 2023-10-14 12:35:45,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1702171.3333333333, ans=0.125 2023-10-14 12:35:49,930 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-10-14 12:36:00,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1702218.0, ans=0.0 2023-10-14 12:36:01,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=22.5 2023-10-14 12:36:24,011 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-10-14 12:36:27,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1702311.3333333333, ans=0.07 2023-10-14 12:36:32,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.08 vs. limit=22.5 2023-10-14 12:36:41,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1702358.0, ans=0.0 2023-10-14 12:36:41,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1702358.0, ans=0.125 2023-10-14 12:36:44,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1702404.6666666667, ans=0.0 2023-10-14 12:36:56,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1702451.3333333333, ans=0.125 2023-10-14 12:37:03,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1702451.3333333333, ans=0.2 2023-10-14 12:37:15,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1702498.0, ans=0.125 2023-10-14 12:37:17,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1702498.0, ans=0.2 2023-10-14 12:37:26,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.860e+02 2.027e+02 2.365e+02 3.468e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-14 12:37:38,143 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.42 vs. limit=15.0 2023-10-14 12:37:38,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1702591.3333333333, ans=0.0 2023-10-14 12:37:46,229 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:37:56,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702638.0, ans=0.1 2023-10-14 12:37:58,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1702684.6666666667, ans=0.2 2023-10-14 12:38:07,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-10-14 12:38:13,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1702731.3333333333, ans=0.125 2023-10-14 12:38:21,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1702731.3333333333, ans=0.125 2023-10-14 12:38:34,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702824.6666666667, ans=0.1 2023-10-14 12:38:59,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1702918.0, ans=0.125 2023-10-14 12:39:02,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1702918.0, ans=0.025 2023-10-14 12:39:12,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1702964.6666666667, ans=0.2 2023-10-14 12:39:24,522 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=22.5 2023-10-14 12:39:26,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.758e+02 1.934e+02 2.156e+02 2.765e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-14 12:39:34,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1703011.3333333333, ans=0.0 2023-10-14 12:39:42,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1703058.0, ans=0.125 2023-10-14 12:39:58,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1703104.6666666667, ans=0.07 2023-10-14 12:40:15,390 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.11 vs. limit=22.5 2023-10-14 12:40:21,785 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-10-14 12:40:22,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-10-14 12:40:27,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1703244.6666666667, ans=0.0 2023-10-14 12:40:37,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1703291.3333333333, ans=0.125 2023-10-14 12:40:39,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1703291.3333333333, ans=0.2 2023-10-14 12:41:04,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=12.0 2023-10-14 12:41:15,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1703431.3333333333, ans=0.0 2023-10-14 12:41:21,823 INFO [train.py:1031] (3/4) Epoch 27, batch 10000, loss[loss=0.2006, simple_loss=0.2877, pruned_loss=0.05673, over 16892.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2776, pruned_loss=0.04616, over 32605932.56 frames. ], batch size: 130, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 12:41:23,692 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.778e+02 1.981e+02 2.187e+02 3.000e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-14 12:41:33,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1703524.6666666667, ans=0.125 2023-10-14 12:41:39,508 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-14 12:41:41,036 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1703524.6666666667, ans=0.2 2023-10-14 12:41:45,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=15.0 2023-10-14 12:41:54,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1703618.0, ans=0.0 2023-10-14 12:42:14,431 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-10-14 12:42:24,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.16 vs. limit=15.0 2023-10-14 12:42:26,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1703711.3333333333, ans=0.0 2023-10-14 12:42:36,783 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.67 vs. limit=15.0 2023-10-14 12:42:37,542 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-10-14 12:42:48,687 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.63 vs. limit=22.5 2023-10-14 12:42:59,676 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:42:59,690 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:43:11,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1703898.0, ans=0.125 2023-10-14 12:43:18,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1703898.0, ans=0.2 2023-10-14 12:43:24,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.864e+02 2.090e+02 2.266e+02 2.870e+02, threshold=4.181e+02, percent-clipped=0.0 2023-10-14 12:43:29,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1703944.6666666667, ans=0.0 2023-10-14 12:43:47,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1704038.0, ans=0.125 2023-10-14 12:43:58,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1704084.6666666667, ans=0.125 2023-10-14 12:44:02,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1704084.6666666667, ans=0.125 2023-10-14 12:44:16,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1704131.3333333333, ans=0.1 2023-10-14 12:44:24,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1704178.0, ans=0.125 2023-10-14 12:44:41,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1704224.6666666667, ans=0.1 2023-10-14 12:44:42,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1704271.3333333333, ans=0.125 2023-10-14 12:44:44,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=22.5 2023-10-14 12:45:21,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-10-14 12:45:25,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.816e+02 2.032e+02 2.324e+02 3.198e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 12:45:38,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1704458.0, ans=0.125 2023-10-14 12:45:39,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704458.0, ans=0.1 2023-10-14 12:45:56,557 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2023-10-14 12:46:32,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1704644.6666666667, ans=0.0 2023-10-14 12:47:22,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704831.3333333333, ans=0.1 2023-10-14 12:47:31,108 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.68 vs. limit=12.0 2023-10-14 12:47:33,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.857e+02 2.010e+02 2.194e+02 3.364e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 12:48:15,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1705018.0, ans=0.04949747468305833 2023-10-14 12:48:21,474 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=22.5 2023-10-14 12:48:32,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1705111.3333333333, ans=0.1 2023-10-14 12:49:11,482 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.69 vs. limit=15.0 2023-10-14 12:49:12,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1705251.3333333333, ans=0.125 2023-10-14 12:49:14,357 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:49:16,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1705251.3333333333, ans=0.0 2023-10-14 12:49:29,952 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=15.0 2023-10-14 12:49:35,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1705298.0, ans=0.0 2023-10-14 12:49:37,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1705344.6666666667, ans=0.125 2023-10-14 12:49:37,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1705344.6666666667, ans=0.125 2023-10-14 12:49:38,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1705344.6666666667, ans=0.125 2023-10-14 12:49:42,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.761e+02 1.907e+02 2.084e+02 2.890e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-14 12:49:44,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=12.0 2023-10-14 12:50:17,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1705484.6666666667, ans=0.2 2023-10-14 12:50:18,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1705484.6666666667, ans=0.125 2023-10-14 12:50:27,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1705484.6666666667, ans=0.0 2023-10-14 12:50:33,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1705531.3333333333, ans=0.1 2023-10-14 12:50:36,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1705531.3333333333, ans=0.0 2023-10-14 12:50:43,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1705578.0, ans=0.125 2023-10-14 12:51:07,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1705624.6666666667, ans=0.2 2023-10-14 12:51:12,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1705671.3333333333, ans=0.2 2023-10-14 12:51:37,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1705764.6666666667, ans=0.125 2023-10-14 12:51:39,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1705764.6666666667, ans=0.125 2023-10-14 12:51:41,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1705764.6666666667, ans=0.0 2023-10-14 12:51:41,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1705764.6666666667, ans=0.2 2023-10-14 12:51:43,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705811.3333333333, ans=0.1 2023-10-14 12:51:44,413 INFO [train.py:1031] (3/4) Epoch 27, batch 10500, loss[loss=0.1986, simple_loss=0.289, pruned_loss=0.05407, over 16609.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.278, pruned_loss=0.04624, over 32653969.74 frames. ], batch size: 241, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 12:51:45,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1705811.3333333333, ans=0.2 2023-10-14 12:51:47,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.918e+02 2.127e+02 2.391e+02 3.501e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-14 12:52:11,679 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:52:12,673 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:52:21,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1705951.3333333333, ans=0.125 2023-10-14 12:52:46,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1706044.6666666667, ans=0.125 2023-10-14 12:52:47,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1706044.6666666667, ans=0.0 2023-10-14 12:52:48,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1706044.6666666667, ans=0.125 2023-10-14 12:52:54,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1706091.3333333333, ans=0.125 2023-10-14 12:53:16,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1706138.0, ans=0.0 2023-10-14 12:53:22,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1706138.0, ans=0.125 2023-10-14 12:53:29,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1706184.6666666667, ans=0.0 2023-10-14 12:53:33,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.69 vs. limit=10.0 2023-10-14 12:53:36,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1706231.3333333333, ans=0.1 2023-10-14 12:53:42,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1706231.3333333333, ans=0.125 2023-10-14 12:53:46,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1706231.3333333333, ans=0.125 2023-10-14 12:53:54,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.841e+02 1.993e+02 2.231e+02 3.025e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-14 12:54:11,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706324.6666666667, ans=0.1 2023-10-14 12:54:18,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1706371.3333333333, ans=0.125 2023-10-14 12:54:18,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=22.5 2023-10-14 12:54:21,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1706371.3333333333, ans=0.0 2023-10-14 12:54:22,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1706371.3333333333, ans=0.2 2023-10-14 12:54:23,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1706371.3333333333, ans=0.125 2023-10-14 12:54:45,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1706464.6666666667, ans=0.125 2023-10-14 12:54:54,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1706511.3333333333, ans=0.125 2023-10-14 12:54:54,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.28 vs. limit=15.0 2023-10-14 12:55:21,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1706604.6666666667, ans=0.125 2023-10-14 12:55:34,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1706651.3333333333, ans=0.125 2023-10-14 12:55:57,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1706744.6666666667, ans=0.05 2023-10-14 12:56:03,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.840e+02 1.999e+02 2.244e+02 3.326e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-14 12:56:07,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-10-14 12:56:34,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1706884.6666666667, ans=0.2 2023-10-14 12:57:09,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1707024.6666666667, ans=0.0 2023-10-14 12:57:31,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1707071.3333333333, ans=0.0 2023-10-14 12:58:09,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.960e+02 2.216e+02 2.530e+02 3.662e+02, threshold=4.433e+02, percent-clipped=0.0 2023-10-14 12:58:16,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.81 vs. limit=22.5 2023-10-14 12:58:36,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1707304.6666666667, ans=0.125 2023-10-14 12:58:54,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1707398.0, ans=0.125 2023-10-14 12:58:54,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1707398.0, ans=0.0 2023-10-14 12:59:03,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1707444.6666666667, ans=0.0 2023-10-14 12:59:10,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1707444.6666666667, ans=0.0 2023-10-14 12:59:17,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707491.3333333333, ans=0.1 2023-10-14 13:00:06,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707678.0, ans=0.1 2023-10-14 13:00:07,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-14 13:00:09,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.765e+02 1.913e+02 2.138e+02 2.974e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-14 13:00:19,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1707724.6666666667, ans=0.0 2023-10-14 13:00:19,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1707724.6666666667, ans=0.1 2023-10-14 13:00:21,292 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1707724.6666666667, ans=0.125 2023-10-14 13:00:31,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1707771.3333333333, ans=0.125 2023-10-14 13:00:49,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1707864.6666666667, ans=0.2 2023-10-14 13:01:09,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.98 vs. limit=15.0 2023-10-14 13:01:32,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1708004.6666666667, ans=0.0 2023-10-14 13:01:40,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1708051.3333333333, ans=22.5 2023-10-14 13:02:04,497 INFO [train.py:1031] (3/4) Epoch 27, batch 11000, loss[loss=0.1941, simple_loss=0.287, pruned_loss=0.05065, over 16914.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.278, pruned_loss=0.04624, over 32707352.70 frames. ], batch size: 130, lr: 1.26e-03, grad_scale: 16.0 2023-10-14 13:02:05,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1708144.6666666667, ans=0.2 2023-10-14 13:02:10,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.880e+02 1.989e+02 2.187e+02 3.195e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 13:02:11,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.75 vs. limit=15.0 2023-10-14 13:02:16,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1708191.3333333333, ans=0.0 2023-10-14 13:02:25,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.73 vs. limit=10.0 2023-10-14 13:02:28,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1708238.0, ans=0.125 2023-10-14 13:02:43,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1708284.6666666667, ans=0.125 2023-10-14 13:02:48,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1708284.6666666667, ans=0.125 2023-10-14 13:02:56,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1708331.3333333333, ans=0.0 2023-10-14 13:03:29,720 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.17 vs. limit=15.0 2023-10-14 13:03:53,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1708518.0, ans=0.125 2023-10-14 13:03:55,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1708564.6666666667, ans=0.0 2023-10-14 13:04:18,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.899e+02 2.032e+02 2.234e+02 3.272e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 13:04:29,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1708658.0, ans=0.125 2023-10-14 13:04:38,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-10-14 13:05:08,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1708798.0, ans=0.0 2023-10-14 13:05:09,677 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.35 vs. limit=15.0 2023-10-14 13:05:11,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-10-14 13:05:12,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1708798.0, ans=0.2 2023-10-14 13:05:15,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1708798.0, ans=0.125 2023-10-14 13:05:15,786 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:05:17,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1708798.0, ans=0.1 2023-10-14 13:05:18,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.35 vs. limit=15.0 2023-10-14 13:05:25,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1708844.6666666667, ans=0.0 2023-10-14 13:05:32,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1708844.6666666667, ans=0.125 2023-10-14 13:05:37,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1708891.3333333333, ans=0.125 2023-10-14 13:05:39,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1708891.3333333333, ans=0.0 2023-10-14 13:05:49,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1708938.0, ans=0.1 2023-10-14 13:05:56,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1708938.0, ans=0.125 2023-10-14 13:05:59,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=22.5 2023-10-14 13:06:05,887 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-10-14 13:06:17,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-10-14 13:06:29,173 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1709078.0, ans=0.125 2023-10-14 13:06:32,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1709078.0, ans=0.125 2023-10-14 13:06:35,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.805e+02 1.949e+02 2.191e+02 3.007e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-14 13:06:42,409 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-10-14 13:06:51,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.73 vs. limit=22.5 2023-10-14 13:06:56,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-10-14 13:07:11,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1709218.0, ans=0.125 2023-10-14 13:07:37,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1709311.3333333333, ans=0.125 2023-10-14 13:07:37,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=1709311.3333333333, ans=0.02 2023-10-14 13:07:41,810 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1709311.3333333333, ans=0.0 2023-10-14 13:07:58,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1709404.6666666667, ans=0.125 2023-10-14 13:08:03,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1709404.6666666667, ans=0.125 2023-10-14 13:08:27,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1709498.0, ans=0.125 2023-10-14 13:08:36,030 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-10-14 13:08:45,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.784e+02 1.933e+02 2.117e+02 2.895e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-14 13:08:59,435 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1709591.3333333333, ans=0.07 2023-10-14 13:09:08,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1709638.0, ans=0.0 2023-10-14 13:09:10,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1709638.0, ans=0.0 2023-10-14 13:09:10,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1709638.0, ans=0.0 2023-10-14 13:09:14,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1709638.0, ans=0.0 2023-10-14 13:09:35,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1709731.3333333333, ans=0.2 2023-10-14 13:09:45,314 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-10-14 13:09:47,628 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.50 vs. limit=15.0 2023-10-14 13:10:11,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1709824.6666666667, ans=0.5 2023-10-14 13:10:28,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1709871.3333333333, ans=0.2 2023-10-14 13:10:39,968 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.50 vs. limit=15.0 2023-10-14 13:10:47,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1709964.6666666667, ans=0.125 2023-10-14 13:11:07,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.811e+02 1.931e+02 2.184e+02 3.157e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 13:11:29,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=1710104.6666666667, ans=0.1 2023-10-14 13:12:24,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1710291.3333333333, ans=0.0 2023-10-14 13:12:36,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1710338.0, ans=0.0 2023-10-14 13:12:46,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1710384.6666666667, ans=0.2 2023-10-14 13:13:01,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1710431.3333333333, ans=0.0 2023-10-14 13:13:11,164 INFO [train.py:1031] (3/4) Epoch 27, batch 11500, loss[loss=0.2005, simple_loss=0.2971, pruned_loss=0.05195, over 16548.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2777, pruned_loss=0.04631, over 32690528.32 frames. ], batch size: 266, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 13:13:17,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.912e+02 2.093e+02 2.248e+02 3.057e+02, threshold=4.185e+02, percent-clipped=0.0 2023-10-14 13:13:45,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1710571.3333333333, ans=0.09899494936611666 2023-10-14 13:13:46,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1710618.0, ans=0.035 2023-10-14 13:14:02,466 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-10-14 13:14:28,056 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=15.0 2023-10-14 13:14:38,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1710758.0, ans=0.125 2023-10-14 13:14:41,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1710804.6666666667, ans=0.2 2023-10-14 13:14:51,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1710804.6666666667, ans=0.1 2023-10-14 13:15:09,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1710851.3333333333, ans=0.0 2023-10-14 13:15:34,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1710944.6666666667, ans=0.0 2023-10-14 13:15:38,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.871e+02 2.003e+02 2.202e+02 2.942e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-14 13:15:50,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1710991.3333333333, ans=0.0 2023-10-14 13:16:08,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1711038.0, ans=0.125 2023-10-14 13:16:20,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1711084.6666666667, ans=0.125 2023-10-14 13:16:20,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-10-14 13:16:33,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711084.6666666667, ans=0.1 2023-10-14 13:16:43,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711131.3333333333, ans=0.1 2023-10-14 13:16:53,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=12.0 2023-10-14 13:17:04,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1711178.0, ans=0.125 2023-10-14 13:17:16,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1711224.6666666667, ans=0.1 2023-10-14 13:17:23,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1711271.3333333333, ans=0.125 2023-10-14 13:17:33,304 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1711318.0, ans=0.0 2023-10-14 13:17:34,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1711318.0, ans=0.125 2023-10-14 13:17:47,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1711364.6666666667, ans=0.0 2023-10-14 13:18:07,616 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.769e+02 1.935e+02 2.107e+02 2.518e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-14 13:18:16,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1711458.0, ans=0.1 2023-10-14 13:18:16,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=15.0 2023-10-14 13:18:39,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1711551.3333333333, ans=0.5 2023-10-14 13:18:46,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1711551.3333333333, ans=0.125 2023-10-14 13:18:55,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1711598.0, ans=0.0 2023-10-14 13:19:07,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1711598.0, ans=0.0 2023-10-14 13:19:56,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1711738.0, ans=0.125 2023-10-14 13:19:58,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1711784.6666666667, ans=0.125 2023-10-14 13:20:28,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1711831.3333333333, ans=0.0 2023-10-14 13:20:41,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.835e+02 2.034e+02 2.337e+02 3.433e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-14 13:20:47,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1711878.0, ans=0.0 2023-10-14 13:21:04,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1711924.6666666667, ans=0.125 2023-10-14 13:21:11,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1711971.3333333333, ans=0.5 2023-10-14 13:21:34,185 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1712018.0, ans=0.125 2023-10-14 13:21:38,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1712018.0, ans=0.0 2023-10-14 13:21:53,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1712111.3333333333, ans=0.125 2023-10-14 13:22:09,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.79 vs. limit=10.0 2023-10-14 13:22:42,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1712251.3333333333, ans=0.125 2023-10-14 13:23:03,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1712298.0, ans=0.025 2023-10-14 13:23:20,332 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=22.5 2023-10-14 13:23:21,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.898e+02 2.062e+02 2.343e+02 3.588e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-14 13:23:30,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-10-14 13:23:31,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1712391.3333333333, ans=0.0 2023-10-14 13:23:45,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1712438.0, ans=0.0 2023-10-14 13:23:54,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=10.0 2023-10-14 13:24:23,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1712578.0, ans=0.125 2023-10-14 13:24:26,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1712578.0, ans=0.0 2023-10-14 13:24:35,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-10-14 13:25:16,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1712764.6666666667, ans=0.125 2023-10-14 13:25:24,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1712764.6666666667, ans=0.04949747468305833 2023-10-14 13:25:26,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-10-14 13:25:28,071 INFO [train.py:1031] (3/4) Epoch 27, batch 12000, loss[loss=0.1832, simple_loss=0.2801, pruned_loss=0.04314, over 16922.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2778, pruned_loss=0.04607, over 32719523.96 frames. ], batch size: 110, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 13:25:36,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.793e+02 1.927e+02 2.044e+02 3.032e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-14 13:25:42,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1712858.0, ans=0.0 2023-10-14 13:25:42,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1712858.0, ans=0.125 2023-10-14 13:25:46,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712858.0, ans=0.1 2023-10-14 13:25:47,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1712858.0, ans=0.05 2023-10-14 13:25:49,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1712858.0, ans=0.125 2023-10-14 13:26:02,164 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-10-14 13:26:08,873 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2023-10-14 13:26:14,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712951.3333333333, ans=0.1 2023-10-14 13:26:56,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-10-14 13:27:00,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-10-14 13:27:08,808 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-10-14 13:27:26,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1713184.6666666667, ans=0.125 2023-10-14 13:27:31,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1713231.3333333333, ans=0.04949747468305833 2023-10-14 13:27:54,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.698e+02 1.900e+02 2.173e+02 2.726e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-14 13:27:57,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1713324.6666666667, ans=0.1 2023-10-14 13:28:20,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1713371.3333333333, ans=0.125 2023-10-14 13:28:23,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-14 13:28:26,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1713418.0, ans=0.0 2023-10-14 13:28:38,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=22.5 2023-10-14 13:28:48,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1713464.6666666667, ans=10.0 2023-10-14 13:29:13,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1713558.0, ans=0.1 2023-10-14 13:29:15,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-10-14 13:29:28,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1713604.6666666667, ans=0.0 2023-10-14 13:29:43,789 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1713651.3333333333, ans=0.125 2023-10-14 13:29:46,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1713698.0, ans=0.1 2023-10-14 13:29:54,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1713698.0, ans=0.015 2023-10-14 13:29:58,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713744.6666666667, ans=0.125 2023-10-14 13:29:59,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1713744.6666666667, ans=0.125 2023-10-14 13:29:59,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1713744.6666666667, ans=0.0 2023-10-14 13:30:00,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1713744.6666666667, ans=0.0 2023-10-14 13:30:05,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.852e+02 1.978e+02 2.177e+02 3.141e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-14 13:30:15,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713791.3333333333, ans=0.125 2023-10-14 13:30:35,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1713884.6666666667, ans=0.125 2023-10-14 13:30:51,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1713931.3333333333, ans=0.125 2023-10-14 13:30:51,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713931.3333333333, ans=0.125 2023-10-14 13:30:51,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1713931.3333333333, ans=0.0 2023-10-14 13:30:54,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1713931.3333333333, ans=0.1 2023-10-14 13:30:57,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1713931.3333333333, ans=0.2 2023-10-14 13:31:17,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1714024.6666666667, ans=0.2 2023-10-14 13:31:38,678 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:31:58,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1714164.6666666667, ans=0.0 2023-10-14 13:32:12,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.924e+02 2.192e+02 2.512e+02 3.029e+02, threshold=4.384e+02, percent-clipped=0.0 2023-10-14 13:32:17,217 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.54 vs. limit=22.5 2023-10-14 13:32:17,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1714258.0, ans=0.125 2023-10-14 13:32:26,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1714258.0, ans=0.0 2023-10-14 13:32:40,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1714351.3333333333, ans=0.125 2023-10-14 13:32:52,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714398.0, ans=0.1 2023-10-14 13:33:41,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1714538.0, ans=0.125 2023-10-14 13:33:42,018 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-10-14 13:34:32,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.806e+02 1.954e+02 2.145e+02 2.735e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-14 13:34:34,807 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.65 vs. limit=15.0 2023-10-14 13:34:42,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1714724.6666666667, ans=0.125 2023-10-14 13:34:52,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1714771.3333333333, ans=0.125 2023-10-14 13:35:04,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1714818.0, ans=0.125 2023-10-14 13:35:14,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1714864.6666666667, ans=0.125 2023-10-14 13:35:18,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1714864.6666666667, ans=0.125 2023-10-14 13:35:37,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1714911.3333333333, ans=0.125 2023-10-14 13:35:42,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1714911.3333333333, ans=0.0 2023-10-14 13:35:44,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1714911.3333333333, ans=0.0 2023-10-14 13:35:44,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1714911.3333333333, ans=0.1 2023-10-14 13:35:53,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1714958.0, ans=0.125 2023-10-14 13:36:12,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715004.6666666667, ans=0.1 2023-10-14 13:36:14,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1715051.3333333333, ans=0.125 2023-10-14 13:36:18,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.73 vs. limit=15.0 2023-10-14 13:36:36,795 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1715098.0, ans=0.0 2023-10-14 13:36:45,443 INFO [train.py:1031] (3/4) Epoch 27, batch 12500, loss[loss=0.1751, simple_loss=0.28, pruned_loss=0.03509, over 16814.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04608, over 32740829.32 frames. ], batch size: 98, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 13:36:56,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.832e+02 2.046e+02 2.257e+02 2.755e+02, threshold=4.091e+02, percent-clipped=0.0 2023-10-14 13:37:23,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1715238.0, ans=0.2 2023-10-14 13:37:25,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1715284.6666666667, ans=0.125 2023-10-14 13:37:33,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1715284.6666666667, ans=0.2 2023-10-14 13:37:45,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.16 vs. limit=15.0 2023-10-14 13:37:56,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1715378.0, ans=0.125 2023-10-14 13:38:28,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1715471.3333333333, ans=0.0 2023-10-14 13:38:28,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715471.3333333333, ans=0.1 2023-10-14 13:38:50,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1715518.0, ans=0.0 2023-10-14 13:39:07,014 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2023-10-14 13:39:08,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2023-10-14 13:39:12,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.49 vs. limit=22.5 2023-10-14 13:39:16,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.834e+02 2.092e+02 2.359e+02 3.771e+02, threshold=4.183e+02, percent-clipped=0.0 2023-10-14 13:39:19,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1715658.0, ans=0.015 2023-10-14 13:39:21,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1715658.0, ans=0.125 2023-10-14 13:39:46,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1715704.6666666667, ans=0.125 2023-10-14 13:40:16,467 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1715844.6666666667, ans=22.5 2023-10-14 13:40:20,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1715844.6666666667, ans=0.07 2023-10-14 13:40:25,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1715844.6666666667, ans=0.1 2023-10-14 13:40:40,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1715891.3333333333, ans=0.125 2023-10-14 13:41:04,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1715984.6666666667, ans=0.0 2023-10-14 13:41:13,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-10-14 13:41:18,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1716078.0, ans=0.125 2023-10-14 13:41:21,935 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=22.5 2023-10-14 13:41:28,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.817e+02 1.972e+02 2.139e+02 2.933e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 13:41:57,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1716171.3333333333, ans=0.0 2023-10-14 13:42:15,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1716264.6666666667, ans=0.125 2023-10-14 13:42:17,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1716264.6666666667, ans=0.035 2023-10-14 13:42:31,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1716311.3333333333, ans=0.125 2023-10-14 13:42:40,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.69 vs. limit=10.0 2023-10-14 13:42:44,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1716358.0, ans=0.125 2023-10-14 13:42:50,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1716404.6666666667, ans=0.125 2023-10-14 13:42:57,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1716404.6666666667, ans=0.125 2023-10-14 13:43:33,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.932e+02 2.068e+02 2.298e+02 2.933e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-14 13:43:56,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1716638.0, ans=0.2 2023-10-14 13:44:21,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1716778.0, ans=0.125 2023-10-14 13:44:23,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1716778.0, ans=0.125 2023-10-14 13:44:36,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-10-14 13:44:41,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1716824.6666666667, ans=0.125 2023-10-14 13:44:53,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1716871.3333333333, ans=0.2 2023-10-14 13:44:56,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1716871.3333333333, ans=0.125 2023-10-14 13:45:03,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1716918.0, ans=0.0 2023-10-14 13:45:04,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1716918.0, ans=0.125 2023-10-14 13:45:25,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1717011.3333333333, ans=0.125 2023-10-14 13:45:27,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1717011.3333333333, ans=0.125 2023-10-14 13:45:29,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.805e+02 1.907e+02 2.151e+02 2.759e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-14 13:45:54,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1717151.3333333333, ans=0.125 2023-10-14 13:45:55,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1717151.3333333333, ans=0.1 2023-10-14 13:46:01,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1717151.3333333333, ans=0.125 2023-10-14 13:46:04,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1717151.3333333333, ans=0.2 2023-10-14 13:46:08,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1717198.0, ans=0.0 2023-10-14 13:46:10,739 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-10-14 13:46:15,722 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=22.5 2023-10-14 13:46:21,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1717244.6666666667, ans=0.125 2023-10-14 13:46:21,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1717244.6666666667, ans=0.04949747468305833 2023-10-14 13:46:27,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1717291.3333333333, ans=0.125 2023-10-14 13:46:31,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=22.5 2023-10-14 13:46:33,414 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:46:35,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1717291.3333333333, ans=0.125 2023-10-14 13:46:54,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1717384.6666666667, ans=0.125 2023-10-14 13:46:55,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1717384.6666666667, ans=0.125 2023-10-14 13:47:06,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1717431.3333333333, ans=0.2 2023-10-14 13:47:07,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=22.5 2023-10-14 13:47:15,303 INFO [train.py:1031] (3/4) Epoch 27, batch 13000, loss[loss=0.1833, simple_loss=0.2829, pruned_loss=0.04189, over 16613.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2781, pruned_loss=0.04612, over 32764882.14 frames. ], batch size: 219, lr: 1.26e-03, grad_scale: 16.0 2023-10-14 13:47:25,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.888e+02 2.006e+02 2.250e+02 2.756e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-14 13:47:59,663 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1717618.0, ans=0.0 2023-10-14 13:48:12,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=15.0 2023-10-14 13:48:15,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1717664.6666666667, ans=0.1 2023-10-14 13:48:27,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1717711.3333333333, ans=0.1 2023-10-14 13:48:37,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-10-14 13:48:46,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.77 vs. limit=10.0 2023-10-14 13:48:52,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717804.6666666667, ans=0.1 2023-10-14 13:48:56,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1717804.6666666667, ans=0.0 2023-10-14 13:49:30,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1717944.6666666667, ans=0.2 2023-10-14 13:49:31,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1717944.6666666667, ans=0.125 2023-10-14 13:49:35,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-10-14 13:49:35,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.817e+02 2.001e+02 2.337e+02 3.316e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-14 13:49:41,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1717991.3333333333, ans=0.125 2023-10-14 13:49:43,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1717991.3333333333, ans=0.0 2023-10-14 13:50:19,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1718131.3333333333, ans=0.1 2023-10-14 13:50:23,240 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-10-14 13:50:25,159 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=22.5 2023-10-14 13:50:57,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1718271.3333333333, ans=0.1 2023-10-14 13:51:13,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1718318.0, ans=0.1 2023-10-14 13:51:16,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1718318.0, ans=0.1 2023-10-14 13:51:22,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1718364.6666666667, ans=0.0 2023-10-14 13:51:29,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1718364.6666666667, ans=0.125 2023-10-14 13:51:43,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.793e+02 1.940e+02 2.098e+02 2.749e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-14 13:52:08,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1718551.3333333333, ans=0.125 2023-10-14 13:52:14,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-10-14 13:52:21,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1718598.0, ans=0.0 2023-10-14 13:52:42,568 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:52:58,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1718738.0, ans=0.0 2023-10-14 13:53:01,013 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2023-10-14 13:53:21,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1718831.3333333333, ans=0.125 2023-10-14 13:53:27,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1718831.3333333333, ans=0.125 2023-10-14 13:53:29,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1718831.3333333333, ans=0.0 2023-10-14 13:53:41,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.834e+02 1.964e+02 2.143e+02 6.427e+02, threshold=3.927e+02, percent-clipped=1.0 2023-10-14 13:54:18,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1719064.6666666667, ans=0.125 2023-10-14 13:54:23,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1719064.6666666667, ans=0.2 2023-10-14 13:54:31,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1719111.3333333333, ans=0.0 2023-10-14 13:54:49,201 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2023-10-14 13:54:57,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1719204.6666666667, ans=0.125 2023-10-14 13:54:57,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1719204.6666666667, ans=0.125 2023-10-14 13:55:05,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1719251.3333333333, ans=0.2 2023-10-14 13:55:17,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1719251.3333333333, ans=0.125 2023-10-14 13:55:39,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1719344.6666666667, ans=0.125 2023-10-14 13:55:42,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.868e+02 1.981e+02 2.153e+02 2.879e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-14 13:55:56,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1719438.0, ans=0.1 2023-10-14 13:55:57,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1719438.0, ans=0.2 2023-10-14 13:56:03,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1719438.0, ans=0.125 2023-10-14 13:56:05,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1719438.0, ans=0.125 2023-10-14 13:56:07,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1719484.6666666667, ans=0.0 2023-10-14 13:56:17,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1719531.3333333333, ans=0.0 2023-10-14 13:56:22,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1719531.3333333333, ans=22.5 2023-10-14 13:56:35,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-14 13:56:56,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=22.5 2023-10-14 13:56:57,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1719671.3333333333, ans=0.025 2023-10-14 13:57:11,577 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=15.0 2023-10-14 13:57:13,489 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-14 13:57:19,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1719764.6666666667, ans=0.125 2023-10-14 13:57:26,547 INFO [train.py:1031] (3/4) Epoch 27, batch 13500, loss[loss=0.1915, simple_loss=0.2796, pruned_loss=0.05175, over 15839.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2778, pruned_loss=0.04616, over 32789478.52 frames. ], batch size: 43, lr: 1.26e-03, grad_scale: 16.0 2023-10-14 13:57:32,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1719811.3333333333, ans=0.1 2023-10-14 13:57:36,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.776e+02 1.926e+02 2.122e+02 2.807e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-14 13:57:47,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1719858.0, ans=0.125 2023-10-14 13:57:51,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-10-14 13:57:55,528 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.86 vs. limit=12.0 2023-10-14 13:58:34,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1720044.6666666667, ans=0.125 2023-10-14 13:58:56,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.43 vs. limit=15.0 2023-10-14 13:59:02,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1720184.6666666667, ans=0.07 2023-10-14 13:59:07,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1720184.6666666667, ans=0.125 2023-10-14 13:59:30,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1720278.0, ans=0.125 2023-10-14 13:59:34,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.883e+02 2.023e+02 2.213e+02 2.839e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-14 13:59:35,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1720324.6666666667, ans=0.0 2023-10-14 13:59:45,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1720371.3333333333, ans=0.2 2023-10-14 13:59:47,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1720371.3333333333, ans=0.0 2023-10-14 13:59:51,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1720371.3333333333, ans=0.1 2023-10-14 13:59:53,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1720371.3333333333, ans=0.0 2023-10-14 14:00:00,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=22.5 2023-10-14 14:00:15,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1720464.6666666667, ans=0.125 2023-10-14 14:00:58,549 INFO [train.py:1031] (3/4) Epoch 28, batch 0, loss[loss=0.1644, simple_loss=0.2608, pruned_loss=0.03405, over 16821.00 frames. ], tot_loss[loss=0.1644, simple_loss=0.2608, pruned_loss=0.03405, over 16821.00 frames. ], batch size: 175, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:00:58,551 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-14 14:01:08,798 INFO [train.py:1063] (3/4) Epoch 28, validation: loss=0.2128, simple_loss=0.2998, pruned_loss=0.06294, over 1020973.00 frames. 2023-10-14 14:01:08,799 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-14 14:01:36,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1720628.0, ans=0.1 2023-10-14 14:01:44,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1720674.6666666667, ans=0.07 2023-10-14 14:02:14,745 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-10-14 14:02:17,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.828e+02 2.010e+02 2.241e+02 3.487e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 14:02:49,150 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.76 vs. limit=10.0 2023-10-14 14:03:08,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1720954.6666666667, ans=0.125 2023-10-14 14:03:09,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1721001.3333333333, ans=0.125 2023-10-14 14:03:09,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1721001.3333333333, ans=0.0 2023-10-14 14:03:19,658 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:03:51,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1721141.3333333333, ans=0.125 2023-10-14 14:03:58,583 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2023-10-14 14:03:59,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1721141.3333333333, ans=0.125 2023-10-14 14:04:04,236 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-14 14:04:13,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1721234.6666666667, ans=0.0 2023-10-14 14:04:19,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.794e+02 1.918e+02 2.187e+02 3.134e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-14 14:04:26,654 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:05:06,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-10-14 14:05:20,241 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-10-14 14:05:28,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1721514.6666666667, ans=0.1 2023-10-14 14:05:44,295 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-10-14 14:05:48,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1721608.0, ans=0.0 2023-10-14 14:06:18,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.826e+02 1.960e+02 2.196e+02 4.557e+02, threshold=3.920e+02, percent-clipped=1.0 2023-10-14 14:06:19,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1721701.3333333333, ans=0.125 2023-10-14 14:06:25,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2023-10-14 14:06:25,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=12.0 2023-10-14 14:06:36,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721794.6666666667, ans=0.1 2023-10-14 14:06:43,087 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1721794.6666666667, ans=0.2 2023-10-14 14:06:57,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1721888.0, ans=0.125 2023-10-14 14:06:57,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1721888.0, ans=0.125 2023-10-14 14:07:07,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=12.0 2023-10-14 14:07:08,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1721934.6666666667, ans=0.125 2023-10-14 14:07:09,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1721934.6666666667, ans=0.125 2023-10-14 14:07:28,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1721981.3333333333, ans=0.0 2023-10-14 14:07:57,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722121.3333333333, ans=0.1 2023-10-14 14:08:09,373 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-14 14:08:16,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.796e+02 2.003e+02 2.190e+02 2.865e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-14 14:08:17,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1722168.0, ans=0.1 2023-10-14 14:08:18,842 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.85 vs. limit=15.0 2023-10-14 14:08:23,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1722214.6666666667, ans=0.1 2023-10-14 14:08:40,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1722261.3333333333, ans=0.125 2023-10-14 14:08:48,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-10-14 14:09:01,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1722354.6666666667, ans=0.125 2023-10-14 14:09:09,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1722401.3333333333, ans=0.0 2023-10-14 14:09:19,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1722401.3333333333, ans=0.125 2023-10-14 14:09:26,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1722448.0, ans=0.1 2023-10-14 14:09:38,145 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:09:39,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1722494.6666666667, ans=0.2 2023-10-14 14:09:50,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=22.5 2023-10-14 14:09:51,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1722541.3333333333, ans=0.0 2023-10-14 14:09:52,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1722541.3333333333, ans=0.2 2023-10-14 14:09:52,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1722541.3333333333, ans=0.1 2023-10-14 14:09:57,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1722588.0, ans=0.125 2023-10-14 14:10:19,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.72 vs. limit=10.0 2023-10-14 14:10:19,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.830e+02 2.016e+02 2.217e+02 2.906e+02, threshold=4.033e+02, percent-clipped=0.0 2023-10-14 14:10:36,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1722728.0, ans=0.0 2023-10-14 14:10:36,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1722728.0, ans=0.0 2023-10-14 14:10:53,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1722774.6666666667, ans=0.0 2023-10-14 14:10:58,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1722774.6666666667, ans=0.0 2023-10-14 14:11:04,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1722821.3333333333, ans=0.125 2023-10-14 14:11:08,563 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.51 vs. limit=15.0 2023-10-14 14:11:13,094 INFO [train.py:1031] (3/4) Epoch 28, batch 500, loss[loss=0.1669, simple_loss=0.2635, pruned_loss=0.03514, over 16811.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2785, pruned_loss=0.04668, over 7281638.23 frames. ], batch size: 98, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:11:15,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=1722868.0, ans=0.1 2023-10-14 14:11:16,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1722868.0, ans=0.2 2023-10-14 14:11:38,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-10-14 14:11:40,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=15.0 2023-10-14 14:11:41,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1722961.3333333333, ans=0.125 2023-10-14 14:12:17,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.930e+02 2.120e+02 2.333e+02 3.235e+02, threshold=4.239e+02, percent-clipped=0.0 2023-10-14 14:12:20,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-10-14 14:12:47,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1723241.3333333333, ans=0.125 2023-10-14 14:12:49,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-10-14 14:13:01,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1723288.0, ans=10.0 2023-10-14 14:13:09,387 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.79 vs. limit=15.0 2023-10-14 14:13:11,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1723334.6666666667, ans=0.125 2023-10-14 14:13:30,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1723381.3333333333, ans=0.125 2023-10-14 14:13:50,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-10-14 14:13:53,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1723474.6666666667, ans=0.2 2023-10-14 14:14:03,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1723521.3333333333, ans=0.0 2023-10-14 14:14:10,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1723521.3333333333, ans=0.0 2023-10-14 14:14:18,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1723568.0, ans=0.125 2023-10-14 14:14:21,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.926e+02 2.134e+02 2.289e+02 3.205e+02, threshold=4.268e+02, percent-clipped=0.0 2023-10-14 14:14:34,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1723614.6666666667, ans=0.125 2023-10-14 14:14:49,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-10-14 14:14:52,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1723708.0, ans=0.125 2023-10-14 14:15:04,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1723754.6666666667, ans=0.125 2023-10-14 14:15:06,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1723754.6666666667, ans=10.0 2023-10-14 14:15:23,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1723801.3333333333, ans=0.125 2023-10-14 14:15:33,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1723848.0, ans=0.0 2023-10-14 14:15:56,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1723941.3333333333, ans=0.125 2023-10-14 14:16:04,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1723988.0, ans=0.125 2023-10-14 14:16:14,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1723988.0, ans=0.07 2023-10-14 14:16:20,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1724034.6666666667, ans=0.125 2023-10-14 14:16:20,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1724034.6666666667, ans=0.0 2023-10-14 14:16:22,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1724034.6666666667, ans=0.2 2023-10-14 14:16:25,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.883e+02 2.074e+02 2.300e+02 3.072e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-14 14:16:52,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1724128.0, ans=0.1 2023-10-14 14:16:54,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1724128.0, ans=0.125 2023-10-14 14:17:04,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1724174.6666666667, ans=0.0 2023-10-14 14:17:07,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1724221.3333333333, ans=0.05 2023-10-14 14:17:58,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1724361.3333333333, ans=0.125 2023-10-14 14:18:00,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1724408.0, ans=0.5 2023-10-14 14:18:17,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1724454.6666666667, ans=6.0 2023-10-14 14:18:24,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-10-14 14:18:36,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.865e+02 1.999e+02 2.217e+02 3.500e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-14 14:18:56,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-14 14:19:07,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1724641.3333333333, ans=0.0 2023-10-14 14:19:21,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1724688.0, ans=0.125 2023-10-14 14:19:33,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1724734.6666666667, ans=0.0 2023-10-14 14:19:38,346 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.37 vs. limit=15.0 2023-10-14 14:19:42,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1724781.3333333333, ans=0.125 2023-10-14 14:20:00,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-10-14 14:20:09,337 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:20:10,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1724874.6666666667, ans=0.0 2023-10-14 14:20:24,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1724921.3333333333, ans=0.0 2023-10-14 14:20:34,285 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=22.5 2023-10-14 14:20:38,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2023-10-14 14:20:42,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.761e+02 1.908e+02 2.203e+02 3.021e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-14 14:20:49,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1725014.6666666667, ans=0.125 2023-10-14 14:21:00,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725061.3333333333, ans=0.1 2023-10-14 14:21:05,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1725061.3333333333, ans=0.0 2023-10-14 14:21:05,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1725061.3333333333, ans=0.125 2023-10-14 14:21:14,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1725108.0, ans=0.125 2023-10-14 14:21:23,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1725108.0, ans=0.125 2023-10-14 14:21:36,563 INFO [train.py:1031] (3/4) Epoch 28, batch 1000, loss[loss=0.1831, simple_loss=0.2723, pruned_loss=0.0469, over 16939.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.279, pruned_loss=0.04731, over 12904805.58 frames. ], batch size: 72, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:21:40,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1725201.3333333333, ans=10.0 2023-10-14 14:21:42,051 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.27 vs. limit=22.5 2023-10-14 14:21:42,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1725201.3333333333, ans=0.0 2023-10-14 14:21:43,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-10-14 14:22:15,004 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1725341.3333333333, ans=0.125 2023-10-14 14:22:23,059 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1725388.0, ans=0.0 2023-10-14 14:22:41,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.826e+02 2.000e+02 2.197e+02 3.254e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-14 14:22:55,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1725528.0, ans=0.0 2023-10-14 14:23:06,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1725528.0, ans=0.125 2023-10-14 14:23:20,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1725621.3333333333, ans=0.2 2023-10-14 14:24:39,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1725901.3333333333, ans=0.0 2023-10-14 14:24:45,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.824e+02 2.004e+02 2.248e+02 2.999e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 14:24:48,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1725901.3333333333, ans=0.125 2023-10-14 14:24:54,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1725948.0, ans=10.0 2023-10-14 14:25:34,312 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:25:38,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1726088.0, ans=0.125 2023-10-14 14:25:41,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1726088.0, ans=0.0 2023-10-14 14:26:42,845 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1726368.0, ans=0.0 2023-10-14 14:26:48,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.728e+02 1.919e+02 2.127e+02 3.307e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-14 14:26:49,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-10-14 14:27:05,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1726461.3333333333, ans=0.125 2023-10-14 14:27:16,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-10-14 14:27:29,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726554.6666666667, ans=0.1 2023-10-14 14:27:35,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1726554.6666666667, ans=0.125 2023-10-14 14:27:44,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.59 vs. limit=15.0 2023-10-14 14:28:08,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1726694.6666666667, ans=0.125 2023-10-14 14:28:36,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.56 vs. limit=10.0 2023-10-14 14:28:39,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1726788.0, ans=0.125 2023-10-14 14:28:44,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1726834.6666666667, ans=0.125 2023-10-14 14:28:52,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.770e+02 1.913e+02 2.146e+02 3.065e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-14 14:28:55,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1726881.3333333333, ans=0.125 2023-10-14 14:28:57,079 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:29:00,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726881.3333333333, ans=0.1 2023-10-14 14:29:35,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1727021.3333333333, ans=0.125 2023-10-14 14:29:41,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=22.5 2023-10-14 14:29:44,417 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1727068.0, ans=0.0 2023-10-14 14:29:46,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1727068.0, ans=0.125 2023-10-14 14:30:36,830 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=12.0 2023-10-14 14:30:43,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1727254.6666666667, ans=0.0 2023-10-14 14:30:50,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1727301.3333333333, ans=0.125 2023-10-14 14:30:58,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.821e+02 1.963e+02 2.160e+02 2.997e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-14 14:31:18,183 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.44 vs. limit=22.5 2023-10-14 14:31:21,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1727394.6666666667, ans=0.125 2023-10-14 14:31:23,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1727394.6666666667, ans=0.1 2023-10-14 14:31:35,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-10-14 14:31:35,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1727488.0, ans=0.09899494936611666 2023-10-14 14:31:48,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1727534.6666666667, ans=0.0 2023-10-14 14:31:49,290 INFO [train.py:1031] (3/4) Epoch 28, batch 1500, loss[loss=0.1881, simple_loss=0.2829, pruned_loss=0.04669, over 16797.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2773, pruned_loss=0.04634, over 17308784.17 frames. ], batch size: 98, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 14:31:54,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1727534.6666666667, ans=0.2 2023-10-14 14:33:00,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.791e+02 1.976e+02 2.218e+02 3.078e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-14 14:33:07,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1727814.6666666667, ans=0.125 2023-10-14 14:33:08,589 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:33:08,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-14 14:33:09,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1727814.6666666667, ans=0.125 2023-10-14 14:33:13,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-10-14 14:33:26,724 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=15.0 2023-10-14 14:33:39,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1727908.0, ans=0.125 2023-10-14 14:33:43,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-14 14:34:21,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-14 14:34:31,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1728141.3333333333, ans=0.2 2023-10-14 14:34:47,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1728188.0, ans=0.125 2023-10-14 14:34:50,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-10-14 14:34:51,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1728188.0, ans=0.125 2023-10-14 14:35:09,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.790e+02 1.899e+02 2.071e+02 3.153e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-14 14:35:21,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1728281.3333333333, ans=10.0 2023-10-14 14:35:21,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-10-14 14:35:42,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1728374.6666666667, ans=0.0 2023-10-14 14:35:51,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1728374.6666666667, ans=0.125 2023-10-14 14:35:52,262 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=15.0 2023-10-14 14:36:07,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-10-14 14:36:07,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1728468.0, ans=0.0 2023-10-14 14:36:20,634 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1728514.6666666667, ans=0.125 2023-10-14 14:36:27,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-14 14:36:30,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1728561.3333333333, ans=0.125 2023-10-14 14:36:49,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-10-14 14:37:01,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1728654.6666666667, ans=0.125 2023-10-14 14:37:03,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1728701.3333333333, ans=0.1 2023-10-14 14:37:10,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.890e+02 2.067e+02 2.292e+02 3.487e+02, threshold=4.133e+02, percent-clipped=0.0 2023-10-14 14:37:16,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1728748.0, ans=0.2 2023-10-14 14:37:16,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1728748.0, ans=0.0 2023-10-14 14:37:23,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1728748.0, ans=0.0 2023-10-14 14:37:29,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1728794.6666666667, ans=0.125 2023-10-14 14:37:36,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-10-14 14:37:44,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1728841.3333333333, ans=0.125 2023-10-14 14:37:58,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1728888.0, ans=0.125 2023-10-14 14:38:00,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-10-14 14:38:04,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1728888.0, ans=0.1 2023-10-14 14:38:10,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1728934.6666666667, ans=0.0 2023-10-14 14:38:58,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1729074.6666666667, ans=0.2 2023-10-14 14:39:14,973 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:39:28,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.806e+02 2.004e+02 2.184e+02 2.863e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-14 14:39:31,386 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.38 vs. limit=12.0 2023-10-14 14:39:31,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1729214.6666666667, ans=0.125 2023-10-14 14:40:14,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1729354.6666666667, ans=0.0 2023-10-14 14:40:31,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1729401.3333333333, ans=0.125 2023-10-14 14:40:33,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1729401.3333333333, ans=0.2 2023-10-14 14:40:54,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1729448.0, ans=0.09899494936611666 2023-10-14 14:41:22,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1729588.0, ans=0.2 2023-10-14 14:41:26,065 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=15.0 2023-10-14 14:41:40,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1729634.6666666667, ans=0.1 2023-10-14 14:41:46,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1729634.6666666667, ans=0.0 2023-10-14 14:41:50,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.804e+02 1.932e+02 2.173e+02 3.347e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 14:41:53,047 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-10-14 14:42:10,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1729728.0, ans=0.0 2023-10-14 14:42:24,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.41 vs. limit=22.5 2023-10-14 14:42:33,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1729774.6666666667, ans=0.1 2023-10-14 14:42:38,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1729821.3333333333, ans=0.125 2023-10-14 14:42:51,526 INFO [train.py:1031] (3/4) Epoch 28, batch 2000, loss[loss=0.1921, simple_loss=0.2943, pruned_loss=0.04496, over 16773.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2778, pruned_loss=0.04625, over 20753691.38 frames. ], batch size: 175, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:43:11,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1729914.6666666667, ans=0.0 2023-10-14 14:43:27,841 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.85 vs. limit=15.0 2023-10-14 14:43:40,701 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=15.0 2023-10-14 14:43:50,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1730008.0, ans=0.0 2023-10-14 14:44:07,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1730054.6666666667, ans=0.2 2023-10-14 14:44:13,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1730054.6666666667, ans=0.125 2023-10-14 14:44:30,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.871e+02 2.008e+02 2.185e+02 3.185e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-14 14:44:33,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1730148.0, ans=0.125 2023-10-14 14:44:48,113 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-14 14:44:48,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1730148.0, ans=0.1 2023-10-14 14:44:52,146 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1730194.6666666667, ans=0.125 2023-10-14 14:44:58,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1730194.6666666667, ans=0.0 2023-10-14 14:45:04,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1730241.3333333333, ans=0.125 2023-10-14 14:45:20,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730288.0, ans=0.1 2023-10-14 14:45:24,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1730288.0, ans=0.0 2023-10-14 14:45:25,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1730288.0, ans=0.0 2023-10-14 14:45:30,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1730288.0, ans=0.2 2023-10-14 14:45:59,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730334.6666666667, ans=0.1 2023-10-14 14:45:59,994 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-10-14 14:47:32,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.818e+02 2.055e+02 2.294e+02 2.929e+02, threshold=4.110e+02, percent-clipped=0.0 2023-10-14 14:48:05,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1730661.3333333333, ans=0.2 2023-10-14 14:48:07,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1730661.3333333333, ans=0.0 2023-10-14 14:48:08,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1730708.0, ans=0.1 2023-10-14 14:48:10,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1730708.0, ans=0.0 2023-10-14 14:48:20,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1730708.0, ans=0.0 2023-10-14 14:48:40,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1730801.3333333333, ans=0.2 2023-10-14 14:48:59,632 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:49:17,240 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-10-14 14:49:26,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1730941.3333333333, ans=0.125 2023-10-14 14:49:35,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1730941.3333333333, ans=0.125 2023-10-14 14:49:40,889 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=12.0 2023-10-14 14:49:47,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1730988.0, ans=0.0 2023-10-14 14:50:07,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.954e+02 2.147e+02 2.375e+02 3.281e+02, threshold=4.294e+02, percent-clipped=0.0 2023-10-14 14:50:28,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1731128.0, ans=0.0 2023-10-14 14:50:43,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731174.6666666667, ans=0.1 2023-10-14 14:50:48,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1731221.3333333333, ans=0.125 2023-10-14 14:50:55,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=1731221.3333333333, ans=12.0 2023-10-14 14:51:13,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1731314.6666666667, ans=15.0 2023-10-14 14:51:39,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731408.0, ans=0.1 2023-10-14 14:51:53,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.33 vs. limit=15.0 2023-10-14 14:52:18,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.956e+02 2.152e+02 2.408e+02 3.451e+02, threshold=4.305e+02, percent-clipped=0.0 2023-10-14 14:52:22,319 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.76 vs. limit=15.0 2023-10-14 14:52:51,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731641.3333333333, ans=0.1 2023-10-14 14:53:04,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=22.5 2023-10-14 14:53:52,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1731781.3333333333, ans=0.125 2023-10-14 14:53:56,449 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=22.5 2023-10-14 14:53:59,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1731828.0, ans=0.125 2023-10-14 14:54:10,279 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.68 vs. limit=22.5 2023-10-14 14:54:21,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1731921.3333333333, ans=0.125 2023-10-14 14:54:37,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.89 vs. limit=10.0 2023-10-14 14:54:41,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1731968.0, ans=0.0 2023-10-14 14:54:47,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.829e+02 1.961e+02 2.138e+02 2.613e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-14 14:55:43,844 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-14 14:55:44,129 INFO [train.py:1031] (3/4) Epoch 28, batch 2500, loss[loss=0.1902, simple_loss=0.2823, pruned_loss=0.04906, over 16921.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2781, pruned_loss=0.04651, over 23415856.16 frames. ], batch size: 110, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:55:50,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1732201.3333333333, ans=0.2 2023-10-14 14:55:51,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1732201.3333333333, ans=0.125 2023-10-14 14:55:52,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1732201.3333333333, ans=0.0 2023-10-14 14:56:24,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1732341.3333333333, ans=0.125 2023-10-14 14:56:38,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1732341.3333333333, ans=0.125 2023-10-14 14:56:53,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732388.0, ans=0.1 2023-10-14 14:57:11,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.899e+02 2.042e+02 2.204e+02 2.818e+02, threshold=4.084e+02, percent-clipped=0.0 2023-10-14 14:57:29,322 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.78 vs. limit=6.0 2023-10-14 14:57:30,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1732481.3333333333, ans=0.125 2023-10-14 14:57:39,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1732528.0, ans=0.125 2023-10-14 14:57:42,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1732528.0, ans=0.2 2023-10-14 14:58:04,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1732621.3333333333, ans=0.0 2023-10-14 14:58:34,612 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1732714.6666666667, ans=0.0 2023-10-14 14:58:39,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1732714.6666666667, ans=0.1 2023-10-14 14:58:52,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1732761.3333333333, ans=0.2 2023-10-14 14:58:54,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1732761.3333333333, ans=0.0 2023-10-14 14:58:57,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1732761.3333333333, ans=0.1 2023-10-14 14:59:05,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1732808.0, ans=0.125 2023-10-14 14:59:19,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1732854.6666666667, ans=0.125 2023-10-14 14:59:20,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1732854.6666666667, ans=0.0 2023-10-14 14:59:39,402 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-10-14 14:59:43,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.849e+02 1.974e+02 2.089e+02 2.843e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-14 14:59:59,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1732994.6666666667, ans=0.0 2023-10-14 15:00:21,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1733041.3333333333, ans=0.125 2023-10-14 15:00:30,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-10-14 15:00:42,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1733088.0, ans=0.2 2023-10-14 15:01:03,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1733134.6666666667, ans=0.0 2023-10-14 15:01:40,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-14 15:01:56,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1733321.3333333333, ans=0.07 2023-10-14 15:02:25,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.833e+02 2.051e+02 2.238e+02 2.778e+02, threshold=4.101e+02, percent-clipped=0.0 2023-10-14 15:02:28,402 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1733414.6666666667, ans=0.125 2023-10-14 15:02:35,438 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.75 vs. limit=15.0 2023-10-14 15:02:38,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1733414.6666666667, ans=0.1 2023-10-14 15:03:26,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733554.6666666667, ans=0.1 2023-10-14 15:03:54,141 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.10 vs. limit=15.0 2023-10-14 15:03:59,287 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=22.5 2023-10-14 15:04:10,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1733694.6666666667, ans=0.95 2023-10-14 15:04:30,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1733741.3333333333, ans=0.0 2023-10-14 15:04:38,298 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1733788.0, ans=0.05 2023-10-14 15:04:38,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-10-14 15:04:42,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1733788.0, ans=0.2 2023-10-14 15:04:50,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1733788.0, ans=0.125 2023-10-14 15:05:00,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1733834.6666666667, ans=0.125 2023-10-14 15:05:07,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.829e+02 2.030e+02 2.198e+02 3.122e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-14 15:05:08,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1733834.6666666667, ans=0.125 2023-10-14 15:05:26,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1733881.3333333333, ans=0.1 2023-10-14 15:05:46,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.43 vs. limit=15.0 2023-10-14 15:05:47,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1733974.6666666667, ans=0.0 2023-10-14 15:05:53,223 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.24 vs. limit=15.0 2023-10-14 15:06:14,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1734021.3333333333, ans=0.1 2023-10-14 15:07:00,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1734161.3333333333, ans=0.125 2023-10-14 15:07:13,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1734208.0, ans=0.0 2023-10-14 15:07:33,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1734301.3333333333, ans=0.0 2023-10-14 15:07:43,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.806e+02 2.035e+02 2.237e+02 2.651e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-14 15:07:58,658 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.72 vs. limit=15.0 2023-10-14 15:08:01,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1734394.6666666667, ans=0.0 2023-10-14 15:08:05,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1734394.6666666667, ans=0.125 2023-10-14 15:08:29,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-10-14 15:08:32,572 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.23 vs. limit=22.5 2023-10-14 15:08:44,847 INFO [train.py:1031] (3/4) Epoch 28, batch 3000, loss[loss=0.18, simple_loss=0.2685, pruned_loss=0.04575, over 16940.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2772, pruned_loss=0.04633, over 25503146.03 frames. ], batch size: 123, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 15:08:47,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1734534.6666666667, ans=0.0 2023-10-14 15:08:53,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=12.0 2023-10-14 15:08:59,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1734581.3333333333, ans=0.1 2023-10-14 15:09:00,476 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1734581.3333333333, ans=0.125 2023-10-14 15:09:19,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734628.0, ans=0.125 2023-10-14 15:10:07,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.771e+02 1.901e+02 2.082e+02 2.891e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-14 15:10:29,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1734861.3333333333, ans=0.07 2023-10-14 15:10:37,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1734908.0, ans=0.2 2023-10-14 15:11:02,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1734954.6666666667, ans=0.0 2023-10-14 15:11:36,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1735048.0, ans=0.2 2023-10-14 15:11:54,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-10-14 15:12:41,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1735234.6666666667, ans=0.2 2023-10-14 15:12:41,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-14 15:12:45,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.860e+02 1.988e+02 2.175e+02 3.016e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 15:13:16,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1735374.6666666667, ans=0.1 2023-10-14 15:13:17,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=15.0 2023-10-14 15:13:39,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1735421.3333333333, ans=0.125 2023-10-14 15:14:02,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1735468.0, ans=0.125 2023-10-14 15:15:23,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1735654.6666666667, ans=0.1 2023-10-14 15:15:44,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1735701.3333333333, ans=0.2 2023-10-14 15:15:46,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1735701.3333333333, ans=0.0 2023-10-14 15:15:47,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.882e+02 2.003e+02 2.133e+02 3.144e+02, threshold=4.005e+02, percent-clipped=0.0 2023-10-14 15:16:00,270 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:17:08,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1735934.6666666667, ans=0.0 2023-10-14 15:17:31,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1735981.3333333333, ans=0.125 2023-10-14 15:17:39,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1735981.3333333333, ans=0.95 2023-10-14 15:17:59,850 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1736028.0, ans=0.2 2023-10-14 15:18:42,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1736121.3333333333, ans=0.5 2023-10-14 15:18:59,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1736168.0, ans=0.125 2023-10-14 15:19:11,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736168.0, ans=0.1 2023-10-14 15:19:20,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.896e+02 2.034e+02 2.281e+02 3.076e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-14 15:19:32,634 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:19:49,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1736261.3333333333, ans=0.125 2023-10-14 15:20:14,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1736308.0, ans=0.125 2023-10-14 15:20:14,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1736308.0, ans=0.0 2023-10-14 15:20:52,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1736354.6666666667, ans=15.0 2023-10-14 15:21:07,817 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.57 vs. limit=15.0 2023-10-14 15:21:26,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1736401.3333333333, ans=0.125 2023-10-14 15:21:31,158 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-14 15:21:37,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-10-14 15:22:08,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=12.0 2023-10-14 15:22:11,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1736494.6666666667, ans=0.0 2023-10-14 15:22:26,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1736541.3333333333, ans=0.125 2023-10-14 15:22:27,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1736541.3333333333, ans=0.2 2023-10-14 15:22:37,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.15 vs. limit=22.5 2023-10-14 15:22:54,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-14 15:23:01,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1736588.0, ans=0.125 2023-10-14 15:23:34,377 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:23:39,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.914e+02 2.061e+02 2.289e+02 2.978e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-14 15:25:18,350 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1736868.0, ans=0.125 2023-10-14 15:25:21,398 INFO [train.py:1031] (3/4) Epoch 28, batch 3500, loss[loss=0.2011, simple_loss=0.283, pruned_loss=0.05962, over 16709.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2772, pruned_loss=0.04635, over 27151619.93 frames. ], batch size: 56, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 15:25:28,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1736868.0, ans=0.125 2023-10-14 15:26:47,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1737008.0, ans=0.2 2023-10-14 15:27:01,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1737054.6666666667, ans=0.0 2023-10-14 15:27:57,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1737148.0, ans=0.125 2023-10-14 15:28:00,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.860e+02 2.008e+02 2.208e+02 4.351e+02, threshold=4.017e+02, percent-clipped=1.0 2023-10-14 15:28:02,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1737148.0, ans=0.125 2023-10-14 15:28:37,921 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=15.0 2023-10-14 15:28:51,603 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-10-14 15:29:08,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737241.3333333333, ans=0.1 2023-10-14 15:29:56,517 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-10-14 15:31:19,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.27 vs. limit=15.0 2023-10-14 15:31:30,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1737474.6666666667, ans=0.0 2023-10-14 15:31:40,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1737474.6666666667, ans=0.125 2023-10-14 15:32:02,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1737521.3333333333, ans=0.2 2023-10-14 15:33:00,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1737614.6666666667, ans=0.1 2023-10-14 15:33:01,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.806e+02 1.972e+02 2.156e+02 3.674e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-14 15:33:16,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1737614.6666666667, ans=0.125 2023-10-14 15:33:48,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1737661.3333333333, ans=0.0 2023-10-14 15:34:01,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1737708.0, ans=0.125 2023-10-14 15:34:54,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1737801.3333333333, ans=0.125 2023-10-14 15:35:11,422 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-14 15:35:31,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1737894.6666666667, ans=0.07 2023-10-14 15:35:37,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1737894.6666666667, ans=0.125 2023-10-14 15:35:45,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1737941.3333333333, ans=0.125 2023-10-14 15:35:52,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1737941.3333333333, ans=0.0 2023-10-14 15:36:04,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1737988.0, ans=0.125 2023-10-14 15:36:10,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1737988.0, ans=0.125 2023-10-14 15:36:22,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1738034.6666666667, ans=0.0 2023-10-14 15:36:26,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.786e+02 2.013e+02 2.207e+02 2.682e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-14 15:36:41,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1738128.0, ans=0.125 2023-10-14 15:36:47,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1738128.0, ans=0.0 2023-10-14 15:37:05,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1738221.3333333333, ans=0.125 2023-10-14 15:37:11,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1738221.3333333333, ans=0.5 2023-10-14 15:37:12,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.28 vs. limit=15.0 2023-10-14 15:37:17,848 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-14 15:37:21,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1738268.0, ans=0.05 2023-10-14 15:37:24,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=1738268.0, ans=0.1 2023-10-14 15:37:36,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1738314.6666666667, ans=0.09899494936611666 2023-10-14 15:37:50,369 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.96 vs. limit=22.5 2023-10-14 15:37:58,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1738408.0, ans=0.125 2023-10-14 15:38:22,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1738501.3333333333, ans=0.0 2023-10-14 15:38:29,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.788e+02 1.940e+02 2.196e+02 3.463e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-14 15:39:07,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1738688.0, ans=0.125 2023-10-14 15:39:08,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1738688.0, ans=0.04949747468305833 2023-10-14 15:39:10,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1738688.0, ans=0.0 2023-10-14 15:39:10,509 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.01 vs. limit=6.0 2023-10-14 15:39:37,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1738828.0, ans=0.1 2023-10-14 15:39:37,681 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2023-10-14 15:39:50,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-10-14 15:39:58,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1738921.3333333333, ans=0.0 2023-10-14 15:40:02,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1738921.3333333333, ans=0.0 2023-10-14 15:40:06,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1738921.3333333333, ans=0.125 2023-10-14 15:40:11,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1738968.0, ans=0.2 2023-10-14 15:40:20,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1739014.6666666667, ans=0.5 2023-10-14 15:40:21,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.788e+02 1.984e+02 2.218e+02 3.006e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-14 15:40:25,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1739014.6666666667, ans=0.5 2023-10-14 15:40:30,613 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-14 15:40:38,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1739061.3333333333, ans=0.125 2023-10-14 15:40:47,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1739108.0, ans=0.0 2023-10-14 15:41:08,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1739201.3333333333, ans=0.1 2023-10-14 15:41:09,695 INFO [train.py:1031] (3/4) Epoch 28, batch 4000, loss[loss=0.1908, simple_loss=0.2891, pruned_loss=0.04623, over 16768.00 frames. ], tot_loss[loss=0.185, simple_loss=0.277, pruned_loss=0.04648, over 28397523.40 frames. ], batch size: 98, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 15:41:20,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1739201.3333333333, ans=0.0 2023-10-14 15:41:22,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.87 vs. limit=6.0 2023-10-14 15:41:25,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1739248.0, ans=0.2 2023-10-14 15:41:25,888 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-10-14 15:41:28,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1739248.0, ans=0.2 2023-10-14 15:41:47,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1739341.3333333333, ans=0.0 2023-10-14 15:41:49,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1739341.3333333333, ans=0.125 2023-10-14 15:41:50,866 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=22.5 2023-10-14 15:42:00,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1739388.0, ans=0.04949747468305833 2023-10-14 15:42:04,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1739388.0, ans=0.1 2023-10-14 15:42:08,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1739388.0, ans=0.0 2023-10-14 15:42:18,098 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=15.0 2023-10-14 15:42:22,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.840e+02 2.041e+02 2.289e+02 3.355e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-14 15:42:29,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1739481.3333333333, ans=0.125 2023-10-14 15:42:57,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1739621.3333333333, ans=0.125 2023-10-14 15:43:08,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1739668.0, ans=0.2 2023-10-14 15:43:15,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1739668.0, ans=10.0 2023-10-14 15:43:24,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1739714.6666666667, ans=0.0 2023-10-14 15:43:29,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1739761.3333333333, ans=0.95 2023-10-14 15:43:31,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1739761.3333333333, ans=0.0 2023-10-14 15:43:38,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1739761.3333333333, ans=0.0 2023-10-14 15:43:54,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1739854.6666666667, ans=0.125 2023-10-14 15:44:17,801 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.841e+02 1.969e+02 2.167e+02 3.462e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 15:44:58,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=22.5 2023-10-14 15:45:10,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1740088.0, ans=0.07 2023-10-14 15:45:15,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1740134.6666666667, ans=0.1 2023-10-14 15:45:34,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1740181.3333333333, ans=0.125 2023-10-14 15:45:49,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1740228.0, ans=0.09899494936611666 2023-10-14 15:45:56,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1740274.6666666667, ans=0.125 2023-10-14 15:45:57,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1740274.6666666667, ans=0.125 2023-10-14 15:46:00,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1740274.6666666667, ans=15.0 2023-10-14 15:46:02,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-10-14 15:46:02,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1740274.6666666667, ans=15.0 2023-10-14 15:46:08,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1740321.3333333333, ans=0.125 2023-10-14 15:46:14,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1740321.3333333333, ans=0.125 2023-10-14 15:46:30,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1740414.6666666667, ans=0.125 2023-10-14 15:46:31,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.884e+02 2.041e+02 2.273e+02 3.474e+02, threshold=4.082e+02, percent-clipped=0.0 2023-10-14 15:46:38,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1740461.3333333333, ans=0.0 2023-10-14 15:46:50,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1740508.0, ans=0.2 2023-10-14 15:46:51,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740508.0, ans=0.1 2023-10-14 15:46:53,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-10-14 15:47:00,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1740554.6666666667, ans=0.125 2023-10-14 15:47:14,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1740601.3333333333, ans=0.125 2023-10-14 15:48:01,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1740788.0, ans=0.2 2023-10-14 15:48:07,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740834.6666666667, ans=0.1 2023-10-14 15:48:11,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1740834.6666666667, ans=0.0 2023-10-14 15:48:14,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1740834.6666666667, ans=0.1 2023-10-14 15:48:19,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1740881.3333333333, ans=0.0 2023-10-14 15:48:21,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.896e+02 2.054e+02 2.245e+02 3.668e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-14 15:48:39,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1740928.0, ans=0.0 2023-10-14 15:48:46,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740974.6666666667, ans=0.1 2023-10-14 15:49:04,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1741021.3333333333, ans=0.125 2023-10-14 15:49:13,869 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.08 vs. limit=15.0 2023-10-14 15:49:24,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1741114.6666666667, ans=0.0 2023-10-14 15:50:29,329 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.917e+02 2.089e+02 2.285e+02 3.043e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-14 15:50:29,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1741348.0, ans=0.125 2023-10-14 15:50:29,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1741348.0, ans=0.05 2023-10-14 15:50:46,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1741394.6666666667, ans=0.0 2023-10-14 15:50:55,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1741441.3333333333, ans=0.07 2023-10-14 15:51:11,897 INFO [train.py:1031] (3/4) Epoch 28, batch 4500, loss[loss=0.1989, simple_loss=0.2767, pruned_loss=0.0606, over 15778.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2772, pruned_loss=0.04632, over 29363969.80 frames. ], batch size: 350, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 15:51:32,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1741628.0, ans=0.0 2023-10-14 15:51:58,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1741721.3333333333, ans=0.0 2023-10-14 15:52:15,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1741768.0, ans=0.1 2023-10-14 15:52:23,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.849e+02 2.062e+02 2.330e+02 3.275e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-14 15:52:29,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1741861.3333333333, ans=0.0 2023-10-14 15:52:36,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1741861.3333333333, ans=0.125 2023-10-14 15:52:52,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1741954.6666666667, ans=0.1 2023-10-14 15:52:53,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1741954.6666666667, ans=0.1 2023-10-14 15:52:58,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2023-10-14 15:53:03,806 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.00 vs. limit=6.0 2023-10-14 15:53:06,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1742001.3333333333, ans=0.2 2023-10-14 15:53:09,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1742001.3333333333, ans=0.0 2023-10-14 15:53:12,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1742001.3333333333, ans=0.0 2023-10-14 15:53:16,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1742048.0, ans=0.125 2023-10-14 15:54:08,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1742281.3333333333, ans=0.125 2023-10-14 15:54:12,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.804e+02 2.036e+02 2.187e+02 2.860e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-14 15:54:40,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1742374.6666666667, ans=0.125 2023-10-14 15:54:40,871 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.14 vs. limit=22.5 2023-10-14 15:54:49,644 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.77 vs. limit=22.5 2023-10-14 15:54:54,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1742468.0, ans=0.125 2023-10-14 15:55:01,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1742468.0, ans=0.1 2023-10-14 15:55:10,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1742514.6666666667, ans=0.0 2023-10-14 15:55:22,161 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.13 vs. limit=10.0 2023-10-14 15:55:23,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-10-14 15:55:24,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1742561.3333333333, ans=15.0 2023-10-14 15:55:25,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1742561.3333333333, ans=0.04949747468305833 2023-10-14 15:55:43,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1742654.6666666667, ans=0.07 2023-10-14 15:55:49,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1742701.3333333333, ans=0.125 2023-10-14 15:55:51,797 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.66 vs. limit=6.0 2023-10-14 15:55:56,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.57 vs. limit=15.0 2023-10-14 15:56:04,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.855e+02 2.027e+02 2.262e+02 3.196e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-14 15:56:43,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1742934.6666666667, ans=0.125 2023-10-14 15:56:52,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-10-14 15:57:18,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1743028.0, ans=0.025 2023-10-14 15:57:31,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1743074.6666666667, ans=0.125 2023-10-14 15:57:36,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1743121.3333333333, ans=0.0 2023-10-14 15:57:51,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1743168.0, ans=0.07 2023-10-14 15:57:54,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1743168.0, ans=0.0 2023-10-14 15:57:59,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1743214.6666666667, ans=0.125 2023-10-14 15:58:00,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.830e+02 1.950e+02 2.107e+02 3.143e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-14 15:58:32,709 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1743354.6666666667, ans=0.2 2023-10-14 15:58:39,223 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743354.6666666667, ans=0.1 2023-10-14 15:59:14,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1743494.6666666667, ans=0.0 2023-10-14 15:59:15,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-10-14 15:59:17,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1743541.3333333333, ans=0.035 2023-10-14 15:59:20,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1743541.3333333333, ans=0.125 2023-10-14 15:59:25,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1743541.3333333333, ans=0.07 2023-10-14 15:59:34,178 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1743588.0, ans=0.0 2023-10-14 15:59:36,704 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:59:38,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1743588.0, ans=0.0 2023-10-14 15:59:58,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.794e+02 1.979e+02 2.137e+02 2.991e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-14 16:00:05,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743728.0, ans=0.1 2023-10-14 16:00:37,898 INFO [train.py:1031] (3/4) Epoch 28, batch 5000, loss[loss=0.1962, simple_loss=0.2892, pruned_loss=0.05157, over 16889.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2772, pruned_loss=0.04655, over 30148788.19 frames. ], batch size: 72, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 16:00:40,995 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=22.5 2023-10-14 16:01:07,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1743961.3333333333, ans=0.04949747468305833 2023-10-14 16:01:18,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1744008.0, ans=0.2 2023-10-14 16:01:21,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-14 16:01:22,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1744008.0, ans=0.05 2023-10-14 16:01:50,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.864e+02 2.085e+02 2.384e+02 3.644e+02, threshold=4.169e+02, percent-clipped=0.0 2023-10-14 16:02:47,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1744334.6666666667, ans=0.0 2023-10-14 16:02:57,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1744381.3333333333, ans=0.125 2023-10-14 16:03:03,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1744428.0, ans=0.125 2023-10-14 16:03:14,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1744474.6666666667, ans=0.125 2023-10-14 16:03:24,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1744521.3333333333, ans=0.125 2023-10-14 16:03:36,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1744568.0, ans=0.125 2023-10-14 16:03:47,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1744614.6666666667, ans=0.125 2023-10-14 16:03:47,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1744614.6666666667, ans=0.125 2023-10-14 16:03:52,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.902e+02 2.070e+02 2.236e+02 3.472e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-14 16:03:53,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-10-14 16:03:59,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2023-10-14 16:04:07,973 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:04:13,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.36 vs. limit=15.0 2023-10-14 16:04:22,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1744754.6666666667, ans=0.0 2023-10-14 16:04:27,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1744754.6666666667, ans=0.0 2023-10-14 16:04:33,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.67 vs. limit=15.0 2023-10-14 16:04:36,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1744801.3333333333, ans=0.125 2023-10-14 16:04:41,448 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=15.0 2023-10-14 16:04:48,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1744848.0, ans=0.2 2023-10-14 16:04:51,132 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2023-10-14 16:04:56,203 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:05:38,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1745034.6666666667, ans=0.0 2023-10-14 16:05:42,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1745081.3333333333, ans=0.0 2023-10-14 16:05:46,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.816e+02 1.943e+02 2.169e+02 3.281e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-14 16:05:48,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-10-14 16:05:55,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1745128.0, ans=0.125 2023-10-14 16:06:08,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-10-14 16:06:16,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1745221.3333333333, ans=0.1 2023-10-14 16:06:32,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1745221.3333333333, ans=0.125 2023-10-14 16:07:14,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1745408.0, ans=0.125 2023-10-14 16:07:24,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.16 vs. limit=15.0 2023-10-14 16:07:32,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1745454.6666666667, ans=0.0 2023-10-14 16:07:36,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1745501.3333333333, ans=0.125 2023-10-14 16:07:45,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1745548.0, ans=0.125 2023-10-14 16:07:52,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.708e+02 1.854e+02 2.089e+02 2.784e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-14 16:08:00,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1745594.6666666667, ans=0.0 2023-10-14 16:08:15,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1745641.3333333333, ans=0.125 2023-10-14 16:08:21,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-14 16:08:22,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1745688.0, ans=0.0 2023-10-14 16:08:25,103 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-10-14 16:08:27,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.61 vs. limit=15.0 2023-10-14 16:08:57,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1745828.0, ans=0.125 2023-10-14 16:09:02,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1745874.6666666667, ans=0.125 2023-10-14 16:09:08,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1745874.6666666667, ans=0.05 2023-10-14 16:09:20,952 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1745921.3333333333, ans=0.125 2023-10-14 16:09:25,330 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.49 vs. limit=15.0 2023-10-14 16:09:33,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1745968.0, ans=0.125 2023-10-14 16:09:34,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1745968.0, ans=0.125 2023-10-14 16:09:42,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.876e+02 2.053e+02 2.249e+02 3.068e+02, threshold=4.106e+02, percent-clipped=0.0 2023-10-14 16:09:56,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1746061.3333333333, ans=0.5 2023-10-14 16:10:06,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1746108.0, ans=0.2 2023-10-14 16:10:10,186 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:10:20,579 INFO [train.py:1031] (3/4) Epoch 28, batch 5500, loss[loss=0.1642, simple_loss=0.2655, pruned_loss=0.03141, over 16872.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2772, pruned_loss=0.04642, over 30745833.69 frames. ], batch size: 104, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 16:10:50,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746294.6666666667, ans=0.125 2023-10-14 16:10:57,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-10-14 16:11:05,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1746388.0, ans=0.125 2023-10-14 16:11:14,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746434.6666666667, ans=0.1 2023-10-14 16:11:27,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1746481.3333333333, ans=0.0 2023-10-14 16:11:32,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.777e+02 1.973e+02 2.149e+02 2.980e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 16:11:44,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1746528.0, ans=0.2 2023-10-14 16:11:47,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1746528.0, ans=0.125 2023-10-14 16:12:12,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1746668.0, ans=0.0 2023-10-14 16:12:23,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746714.6666666667, ans=0.1 2023-10-14 16:12:34,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1746761.3333333333, ans=0.1 2023-10-14 16:12:43,358 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2023-10-14 16:12:46,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1746808.0, ans=0.125 2023-10-14 16:12:58,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1746854.6666666667, ans=0.125 2023-10-14 16:13:03,979 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-10-14 16:13:14,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1746901.3333333333, ans=0.0 2023-10-14 16:13:14,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1746901.3333333333, ans=0.125 2023-10-14 16:13:23,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.843e+02 1.960e+02 2.255e+02 3.918e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-14 16:13:33,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1746994.6666666667, ans=0.125 2023-10-14 16:13:33,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746994.6666666667, ans=0.125 2023-10-14 16:13:38,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1746994.6666666667, ans=0.0 2023-10-14 16:13:51,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1747041.3333333333, ans=0.0 2023-10-14 16:13:53,934 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.54 vs. limit=15.0 2023-10-14 16:13:55,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1747088.0, ans=0.125 2023-10-14 16:14:02,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1747088.0, ans=0.0 2023-10-14 16:14:09,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1747134.6666666667, ans=0.5 2023-10-14 16:14:17,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1747134.6666666667, ans=0.125 2023-10-14 16:14:33,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1747228.0, ans=0.125 2023-10-14 16:14:35,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747228.0, ans=0.1 2023-10-14 16:14:51,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1747274.6666666667, ans=0.0 2023-10-14 16:14:54,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747321.3333333333, ans=0.1 2023-10-14 16:14:59,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1747321.3333333333, ans=0.0 2023-10-14 16:15:16,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1747414.6666666667, ans=0.125 2023-10-14 16:15:17,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=12.0 2023-10-14 16:15:23,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.852e+02 2.049e+02 2.367e+02 3.817e+02, threshold=4.098e+02, percent-clipped=0.0 2023-10-14 16:15:24,155 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.34 vs. limit=15.0 2023-10-14 16:15:32,300 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=15.0 2023-10-14 16:15:34,345 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1747461.3333333333, ans=0.125 2023-10-14 16:15:44,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-10-14 16:15:48,440 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1747508.0, ans=0.125 2023-10-14 16:15:50,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747508.0, ans=0.1 2023-10-14 16:16:21,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1747648.0, ans=0.0 2023-10-14 16:16:23,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747648.0, ans=0.125 2023-10-14 16:16:41,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1747741.3333333333, ans=0.0 2023-10-14 16:16:41,409 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1747741.3333333333, ans=0.0 2023-10-14 16:16:58,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1747788.0, ans=0.0 2023-10-14 16:17:18,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.829e+02 1.973e+02 2.183e+02 3.097e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 16:17:23,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1747928.0, ans=0.0 2023-10-14 16:17:28,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1747928.0, ans=0.0 2023-10-14 16:17:41,453 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.04 vs. limit=15.0 2023-10-14 16:17:46,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1748021.3333333333, ans=0.125 2023-10-14 16:18:08,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1748114.6666666667, ans=0.125 2023-10-14 16:18:08,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1748114.6666666667, ans=0.0 2023-10-14 16:18:11,099 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:18:36,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1748208.0, ans=0.0 2023-10-14 16:19:16,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.836e+02 1.937e+02 2.254e+02 2.887e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-14 16:19:18,958 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.35 vs. limit=10.0 2023-10-14 16:19:36,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1748441.3333333333, ans=0.125 2023-10-14 16:19:55,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1748488.0, ans=0.0 2023-10-14 16:19:57,123 INFO [train.py:1031] (3/4) Epoch 28, batch 6000, loss[loss=0.2048, simple_loss=0.2942, pruned_loss=0.05776, over 16693.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2776, pruned_loss=0.04662, over 31234028.78 frames. ], batch size: 202, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:20:01,250 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.35 vs. limit=10.0 2023-10-14 16:20:48,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748721.3333333333, ans=0.1 2023-10-14 16:20:48,916 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:21:06,535 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1748768.0, ans=0.0 2023-10-14 16:21:06,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1748768.0, ans=0.2 2023-10-14 16:21:11,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1748814.6666666667, ans=0.0 2023-10-14 16:21:12,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1748814.6666666667, ans=0.2 2023-10-14 16:21:16,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.904e+02 2.046e+02 2.284e+02 3.220e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-14 16:21:31,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1748861.3333333333, ans=0.125 2023-10-14 16:21:53,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1748954.6666666667, ans=0.125 2023-10-14 16:21:59,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1749001.3333333333, ans=0.125 2023-10-14 16:22:34,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1749141.3333333333, ans=0.125 2023-10-14 16:23:08,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-10-14 16:23:10,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.888e+02 2.018e+02 2.306e+02 5.116e+02, threshold=4.036e+02, percent-clipped=1.0 2023-10-14 16:24:00,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1749468.0, ans=0.2 2023-10-14 16:24:01,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1749468.0, ans=0.125 2023-10-14 16:24:37,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1749608.0, ans=0.1 2023-10-14 16:24:38,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.29 vs. limit=15.0 2023-10-14 16:24:40,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1749654.6666666667, ans=0.125 2023-10-14 16:24:47,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.68 vs. limit=22.5 2023-10-14 16:24:48,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1749654.6666666667, ans=0.125 2023-10-14 16:25:14,166 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1749748.0, ans=0.125 2023-10-14 16:25:15,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.890e+02 2.049e+02 2.209e+02 3.016e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 16:25:22,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1749794.6666666667, ans=0.2 2023-10-14 16:25:37,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1749841.3333333333, ans=0.0 2023-10-14 16:25:40,086 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-14 16:25:41,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.87 vs. limit=22.5 2023-10-14 16:25:44,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1749888.0, ans=0.125 2023-10-14 16:26:48,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750121.3333333333, ans=0.1 2023-10-14 16:26:56,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1750168.0, ans=0.125 2023-10-14 16:27:09,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1750214.6666666667, ans=0.0 2023-10-14 16:27:17,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1750214.6666666667, ans=0.125 2023-10-14 16:27:23,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.862e+02 2.008e+02 2.228e+02 3.432e+02, threshold=4.017e+02, percent-clipped=0.0 2023-10-14 16:27:37,657 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:27:43,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1750308.0, ans=0.125 2023-10-14 16:28:03,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1750354.6666666667, ans=0.125 2023-10-14 16:28:04,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1750401.3333333333, ans=0.2 2023-10-14 16:28:10,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1750401.3333333333, ans=0.125 2023-10-14 16:28:11,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1750401.3333333333, ans=0.2 2023-10-14 16:28:23,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1750448.0, ans=0.0 2023-10-14 16:28:31,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750494.6666666667, ans=0.1 2023-10-14 16:28:48,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1750541.3333333333, ans=0.125 2023-10-14 16:29:26,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.809e+02 2.005e+02 2.346e+02 2.936e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-14 16:29:26,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1750681.3333333333, ans=0.125 2023-10-14 16:29:44,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1750774.6666666667, ans=0.125 2023-10-14 16:29:47,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750774.6666666667, ans=0.1 2023-10-14 16:29:52,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1750774.6666666667, ans=0.125 2023-10-14 16:29:52,557 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-14 16:30:05,693 INFO [train.py:1031] (3/4) Epoch 28, batch 6500, loss[loss=0.1793, simple_loss=0.2762, pruned_loss=0.04124, over 16883.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.278, pruned_loss=0.04678, over 31567944.10 frames. ], batch size: 72, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:30:16,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1750868.0, ans=0.025 2023-10-14 16:30:37,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1750961.3333333333, ans=0.125 2023-10-14 16:30:39,032 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-14 16:30:54,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1751008.0, ans=0.0 2023-10-14 16:30:55,931 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.84 vs. limit=10.0 2023-10-14 16:31:34,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.894e+02 2.087e+02 2.340e+02 3.129e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 16:31:35,800 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1751148.0, ans=0.2 2023-10-14 16:32:16,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1751334.6666666667, ans=0.07 2023-10-14 16:32:20,801 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:32:38,225 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-10-14 16:32:46,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1751474.6666666667, ans=0.125 2023-10-14 16:32:49,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1751474.6666666667, ans=0.0 2023-10-14 16:32:52,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1751474.6666666667, ans=0.0 2023-10-14 16:32:56,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1751521.3333333333, ans=0.1 2023-10-14 16:33:27,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.840e+02 1.993e+02 2.191e+02 3.127e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-14 16:33:30,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.44 vs. limit=15.0 2023-10-14 16:33:38,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-14 16:34:02,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1751754.6666666667, ans=0.125 2023-10-14 16:34:08,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1751801.3333333333, ans=0.125 2023-10-14 16:34:10,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1751801.3333333333, ans=0.1 2023-10-14 16:34:13,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1751801.3333333333, ans=0.125 2023-10-14 16:34:24,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1751848.0, ans=0.2 2023-10-14 16:34:40,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1751941.3333333333, ans=0.125 2023-10-14 16:35:14,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1752034.6666666667, ans=0.125 2023-10-14 16:35:17,850 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:35:26,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.854e+02 2.053e+02 2.343e+02 3.091e+02, threshold=4.107e+02, percent-clipped=0.0 2023-10-14 16:35:29,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1752128.0, ans=0.125 2023-10-14 16:35:37,711 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1752128.0, ans=22.5 2023-10-14 16:35:38,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1752128.0, ans=0.0 2023-10-14 16:35:52,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.51 vs. limit=15.0 2023-10-14 16:36:03,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1752221.3333333333, ans=0.125 2023-10-14 16:36:14,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1752268.0, ans=0.0 2023-10-14 16:36:52,983 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:36:59,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-10-14 16:37:41,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.844e+02 1.978e+02 2.217e+02 2.975e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-14 16:37:46,665 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2023-10-14 16:37:56,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1752641.3333333333, ans=0.125 2023-10-14 16:38:10,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1752688.0, ans=0.125 2023-10-14 16:38:11,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.97 vs. limit=15.0 2023-10-14 16:38:30,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=15.0 2023-10-14 16:38:30,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1752781.3333333333, ans=15.0 2023-10-14 16:38:41,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1752828.0, ans=0.125 2023-10-14 16:38:43,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1752828.0, ans=0.0 2023-10-14 16:38:43,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-14 16:38:53,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1752874.6666666667, ans=0.125 2023-10-14 16:38:55,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1752874.6666666667, ans=0.125 2023-10-14 16:39:26,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-10-14 16:39:34,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.863e+02 2.034e+02 2.261e+02 2.701e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-14 16:40:12,151 INFO [train.py:1031] (3/4) Epoch 28, batch 7000, loss[loss=0.1909, simple_loss=0.2881, pruned_loss=0.0469, over 16923.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2784, pruned_loss=0.04665, over 31876798.73 frames. ], batch size: 123, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:40:25,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1753248.0, ans=0.125 2023-10-14 16:40:28,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1753248.0, ans=10.0 2023-10-14 16:40:58,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1753341.3333333333, ans=0.2 2023-10-14 16:41:14,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1753434.6666666667, ans=0.1 2023-10-14 16:41:17,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1753434.6666666667, ans=0.125 2023-10-14 16:41:37,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.900e+02 2.057e+02 2.255e+02 3.096e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-14 16:41:37,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1753481.3333333333, ans=0.0 2023-10-14 16:41:42,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1753528.0, ans=0.0 2023-10-14 16:41:51,436 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=15.0 2023-10-14 16:41:58,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1753574.6666666667, ans=0.125 2023-10-14 16:42:32,299 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-10-14 16:43:05,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1753854.6666666667, ans=0.2 2023-10-14 16:43:16,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=22.5 2023-10-14 16:43:29,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1753948.0, ans=0.07 2023-10-14 16:43:31,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.884e+02 2.087e+02 2.579e+02 3.779e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 16:43:49,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1754041.3333333333, ans=0.05 2023-10-14 16:44:08,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1754134.6666666667, ans=0.125 2023-10-14 16:44:10,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-10-14 16:44:14,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1754134.6666666667, ans=0.125 2023-10-14 16:44:29,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1754181.3333333333, ans=0.2 2023-10-14 16:44:46,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1754228.0, ans=0.1 2023-10-14 16:44:59,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1754274.6666666667, ans=0.125 2023-10-14 16:45:08,735 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2023-10-14 16:45:15,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1754321.3333333333, ans=0.0 2023-10-14 16:45:22,152 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1754368.0, ans=0.125 2023-10-14 16:45:25,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1754368.0, ans=0.0 2023-10-14 16:45:42,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.831e+02 2.094e+02 2.306e+02 3.767e+02, threshold=4.188e+02, percent-clipped=0.0 2023-10-14 16:45:43,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1754414.6666666667, ans=0.125 2023-10-14 16:46:00,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1754508.0, ans=0.0 2023-10-14 16:47:03,843 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1754741.3333333333, ans=0.125 2023-10-14 16:47:13,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1754741.3333333333, ans=0.2 2023-10-14 16:47:26,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1754788.0, ans=0.0 2023-10-14 16:47:32,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1754834.6666666667, ans=0.1 2023-10-14 16:47:43,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1754881.3333333333, ans=0.125 2023-10-14 16:47:49,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.820e+02 2.004e+02 2.132e+02 3.572e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 16:48:15,521 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:48:40,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1755114.6666666667, ans=0.2 2023-10-14 16:48:41,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1755114.6666666667, ans=0.07 2023-10-14 16:48:45,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1755114.6666666667, ans=0.125 2023-10-14 16:48:51,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1755161.3333333333, ans=0.125 2023-10-14 16:48:57,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1755161.3333333333, ans=0.0 2023-10-14 16:49:05,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-14 16:49:31,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-10-14 16:49:34,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1755348.0, ans=0.125 2023-10-14 16:49:35,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1755348.0, ans=0.125 2023-10-14 16:49:37,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1755348.0, ans=0.2 2023-10-14 16:49:38,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1755348.0, ans=0.125 2023-10-14 16:49:42,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1755348.0, ans=0.05 2023-10-14 16:49:42,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.874e+02 2.061e+02 2.316e+02 3.225e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-14 16:50:22,004 INFO [train.py:1031] (3/4) Epoch 28, batch 7500, loss[loss=0.1797, simple_loss=0.2805, pruned_loss=0.03938, over 16840.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2782, pruned_loss=0.04675, over 32046463.64 frames. ], batch size: 146, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:50:22,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-10-14 16:50:48,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1755628.0, ans=0.125 2023-10-14 16:50:48,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1755628.0, ans=0.125 2023-10-14 16:50:51,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1755628.0, ans=0.04949747468305833 2023-10-14 16:51:10,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755721.3333333333, ans=0.1 2023-10-14 16:51:40,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1755814.6666666667, ans=0.125 2023-10-14 16:51:40,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.885e+02 2.047e+02 2.279e+02 3.374e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-14 16:52:19,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1756001.3333333333, ans=15.0 2023-10-14 16:52:45,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1756094.6666666667, ans=0.125 2023-10-14 16:52:53,912 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.72 vs. limit=10.0 2023-10-14 16:53:04,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1756141.3333333333, ans=0.125 2023-10-14 16:53:29,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1756234.6666666667, ans=0.0 2023-10-14 16:53:37,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1756281.3333333333, ans=0.2 2023-10-14 16:53:46,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.792e+02 1.957e+02 2.147e+02 3.009e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-14 16:54:48,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1756514.6666666667, ans=0.125 2023-10-14 16:54:52,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.48 vs. limit=15.0 2023-10-14 16:54:58,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1756561.3333333333, ans=0.125 2023-10-14 16:55:03,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1756608.0, ans=0.125 2023-10-14 16:55:09,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1756608.0, ans=0.0 2023-10-14 16:55:09,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1756608.0, ans=0.1 2023-10-14 16:55:10,406 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1756608.0, ans=0.0 2023-10-14 16:55:16,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1756654.6666666667, ans=10.0 2023-10-14 16:55:43,388 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-10-14 16:55:43,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.793e+02 2.025e+02 2.489e+02 3.348e+02, threshold=4.050e+02, percent-clipped=0.0 2023-10-14 16:56:13,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.93 vs. limit=22.5 2023-10-14 16:56:17,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1756888.0, ans=0.0 2023-10-14 16:56:21,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1756934.6666666667, ans=0.125 2023-10-14 16:56:26,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1756934.6666666667, ans=0.09899494936611666 2023-10-14 16:56:42,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1756981.3333333333, ans=0.2 2023-10-14 16:56:52,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.09 vs. limit=15.0 2023-10-14 16:57:04,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1757074.6666666667, ans=0.125 2023-10-14 16:57:04,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1757074.6666666667, ans=0.125 2023-10-14 16:57:29,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1757168.0, ans=0.125 2023-10-14 16:57:30,352 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2023-10-14 16:57:37,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1757168.0, ans=0.0 2023-10-14 16:57:47,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.908e+02 2.113e+02 2.337e+02 3.604e+02, threshold=4.226e+02, percent-clipped=0.0 2023-10-14 16:57:49,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1757261.3333333333, ans=0.2 2023-10-14 16:57:52,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=8.0 2023-10-14 16:58:18,656 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=15.0 2023-10-14 16:58:41,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1757448.0, ans=0.125 2023-10-14 16:58:43,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1757448.0, ans=0.0 2023-10-14 16:59:15,523 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=15.0 2023-10-14 16:59:47,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1757681.3333333333, ans=0.125 2023-10-14 16:59:54,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.725e+02 1.890e+02 2.105e+02 2.843e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-14 17:00:22,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1757821.3333333333, ans=0.125 2023-10-14 17:00:23,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1757821.3333333333, ans=0.125 2023-10-14 17:00:28,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.41 vs. limit=15.0 2023-10-14 17:00:30,493 INFO [train.py:1031] (3/4) Epoch 28, batch 8000, loss[loss=0.1816, simple_loss=0.2806, pruned_loss=0.04128, over 16876.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2774, pruned_loss=0.04607, over 32225377.44 frames. ], batch size: 130, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:00:40,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1757868.0, ans=0.125 2023-10-14 17:00:40,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1757868.0, ans=0.125 2023-10-14 17:00:41,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1757914.6666666667, ans=0.0 2023-10-14 17:00:46,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1757914.6666666667, ans=0.0 2023-10-14 17:00:47,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1757914.6666666667, ans=0.125 2023-10-14 17:01:05,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1758008.0, ans=0.125 2023-10-14 17:01:36,590 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-10-14 17:01:44,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.834e+02 2.033e+02 2.233e+02 2.884e+02, threshold=4.067e+02, percent-clipped=0.0 2023-10-14 17:01:49,269 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1758194.6666666667, ans=0.1 2023-10-14 17:02:04,486 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1758241.3333333333, ans=0.0 2023-10-14 17:02:04,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1758241.3333333333, ans=0.125 2023-10-14 17:02:05,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1758241.3333333333, ans=0.125 2023-10-14 17:02:46,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1758428.0, ans=0.0 2023-10-14 17:02:58,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1758474.6666666667, ans=0.05 2023-10-14 17:03:05,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1758521.3333333333, ans=0.125 2023-10-14 17:03:16,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1758521.3333333333, ans=0.1 2023-10-14 17:03:38,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1758568.0, ans=0.125 2023-10-14 17:03:54,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.837e+02 1.973e+02 2.212e+02 2.905e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 17:04:02,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1758661.3333333333, ans=0.0 2023-10-14 17:04:59,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=12.0 2023-10-14 17:05:18,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1758941.3333333333, ans=0.0 2023-10-14 17:05:39,571 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:05:42,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1759034.6666666667, ans=0.0 2023-10-14 17:05:50,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1759081.3333333333, ans=0.0 2023-10-14 17:05:55,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.800e+02 1.977e+02 2.156e+02 2.711e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-14 17:06:03,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1759128.0, ans=0.0 2023-10-14 17:06:28,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-10-14 17:06:47,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=22.5 2023-10-14 17:06:55,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1759314.6666666667, ans=0.0 2023-10-14 17:07:00,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1759361.3333333333, ans=0.1 2023-10-14 17:07:23,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1759454.6666666667, ans=0.125 2023-10-14 17:07:49,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.786e+02 1.948e+02 2.121e+02 3.618e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-14 17:07:53,585 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1759594.6666666667, ans=0.1 2023-10-14 17:08:11,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1759641.3333333333, ans=0.2 2023-10-14 17:08:25,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1759734.6666666667, ans=0.125 2023-10-14 17:08:49,524 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1759781.3333333333, ans=0.1 2023-10-14 17:09:10,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1759874.6666666667, ans=0.125 2023-10-14 17:09:12,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1759874.6666666667, ans=0.125 2023-10-14 17:09:22,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1759921.3333333333, ans=0.0 2023-10-14 17:09:34,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1759968.0, ans=0.0 2023-10-14 17:09:38,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1759968.0, ans=0.125 2023-10-14 17:09:41,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1760014.6666666667, ans=0.125 2023-10-14 17:09:46,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1760014.6666666667, ans=0.125 2023-10-14 17:09:52,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.867e+02 2.014e+02 2.166e+02 3.659e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-14 17:10:08,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1760108.0, ans=10.0 2023-10-14 17:10:16,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1760108.0, ans=0.0 2023-10-14 17:10:19,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760154.6666666667, ans=0.1 2023-10-14 17:10:24,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1760154.6666666667, ans=0.125 2023-10-14 17:10:33,684 INFO [train.py:1031] (3/4) Epoch 28, batch 8500, loss[loss=0.1778, simple_loss=0.2658, pruned_loss=0.04495, over 16615.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2777, pruned_loss=0.04603, over 32351622.12 frames. ], batch size: 61, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:10:34,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760201.3333333333, ans=0.1 2023-10-14 17:10:36,019 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=12.0 2023-10-14 17:10:42,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1760201.3333333333, ans=0.0 2023-10-14 17:10:59,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1760294.6666666667, ans=10.0 2023-10-14 17:11:01,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1760294.6666666667, ans=0.0 2023-10-14 17:11:02,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1760294.6666666667, ans=0.125 2023-10-14 17:11:22,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1760388.0, ans=0.0 2023-10-14 17:11:50,261 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-10-14 17:11:53,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.920e+02 2.089e+02 2.261e+02 3.381e+02, threshold=4.178e+02, percent-clipped=0.0 2023-10-14 17:12:09,235 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=12.0 2023-10-14 17:12:28,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1760621.3333333333, ans=0.125 2023-10-14 17:13:03,785 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1760761.3333333333, ans=0.125 2023-10-14 17:13:47,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1760901.3333333333, ans=0.0 2023-10-14 17:13:59,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.785e+02 1.907e+02 2.084e+02 3.010e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-14 17:14:11,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1760994.6666666667, ans=0.0 2023-10-14 17:14:31,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1761088.0, ans=0.125 2023-10-14 17:14:40,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1761134.6666666667, ans=0.07 2023-10-14 17:14:59,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1761181.3333333333, ans=0.0 2023-10-14 17:15:17,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1761274.6666666667, ans=0.2 2023-10-14 17:15:20,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1761274.6666666667, ans=0.95 2023-10-14 17:15:34,343 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.12 vs. limit=15.0 2023-10-14 17:15:41,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1761321.3333333333, ans=0.125 2023-10-14 17:16:09,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1761414.6666666667, ans=0.04949747468305833 2023-10-14 17:16:13,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1761414.6666666667, ans=0.0 2023-10-14 17:16:14,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.775e+02 1.987e+02 2.278e+02 3.138e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-14 17:16:27,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1761508.0, ans=0.0 2023-10-14 17:16:41,607 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=22.5 2023-10-14 17:16:55,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1761601.3333333333, ans=0.07 2023-10-14 17:17:12,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1761648.0, ans=0.1 2023-10-14 17:17:14,260 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1761694.6666666667, ans=0.125 2023-10-14 17:17:17,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1761694.6666666667, ans=0.1 2023-10-14 17:17:25,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1761741.3333333333, ans=0.125 2023-10-14 17:17:33,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1761741.3333333333, ans=0.2 2023-10-14 17:17:36,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1761788.0, ans=0.1 2023-10-14 17:17:44,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1761788.0, ans=0.0 2023-10-14 17:17:51,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1761834.6666666667, ans=0.125 2023-10-14 17:17:59,970 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.46 vs. limit=15.0 2023-10-14 17:18:06,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.31 vs. limit=10.0 2023-10-14 17:18:08,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.787e+02 1.927e+02 2.145e+02 3.324e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-14 17:18:09,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1761928.0, ans=0.0 2023-10-14 17:18:12,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1761928.0, ans=0.0 2023-10-14 17:18:16,421 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-10-14 17:18:31,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1762021.3333333333, ans=0.2 2023-10-14 17:18:44,614 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:19:11,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1762161.3333333333, ans=0.1 2023-10-14 17:19:23,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1762208.0, ans=0.125 2023-10-14 17:19:34,845 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=22.5 2023-10-14 17:19:39,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1762254.6666666667, ans=0.125 2023-10-14 17:19:42,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1762301.3333333333, ans=0.125 2023-10-14 17:19:44,035 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1762301.3333333333, ans=0.1 2023-10-14 17:19:45,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1762301.3333333333, ans=0.125 2023-10-14 17:19:52,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1762348.0, ans=0.0 2023-10-14 17:20:02,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.823e+02 2.044e+02 2.248e+02 3.008e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-14 17:20:39,452 INFO [train.py:1031] (3/4) Epoch 28, batch 9000, loss[loss=0.1788, simple_loss=0.2779, pruned_loss=0.03983, over 16901.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2772, pruned_loss=0.04578, over 32464777.09 frames. ], batch size: 93, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:20:52,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1762581.3333333333, ans=0.125 2023-10-14 17:20:53,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1762581.3333333333, ans=0.125 2023-10-14 17:20:59,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1762581.3333333333, ans=0.1 2023-10-14 17:21:22,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1762674.6666666667, ans=0.0 2023-10-14 17:21:37,639 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1762768.0, ans=0.125 2023-10-14 17:21:48,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1762814.6666666667, ans=0.125 2023-10-14 17:21:49,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1762814.6666666667, ans=0.0 2023-10-14 17:21:56,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.774e+02 1.875e+02 2.108e+02 2.976e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-14 17:22:20,650 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1762954.6666666667, ans=0.2 2023-10-14 17:22:29,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1762954.6666666667, ans=0.05 2023-10-14 17:22:50,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1763048.0, ans=0.125 2023-10-14 17:23:16,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1763188.0, ans=0.125 2023-10-14 17:23:38,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1763281.3333333333, ans=0.125 2023-10-14 17:23:48,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.844e+02 1.987e+02 2.202e+02 2.868e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-14 17:24:02,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1763374.6666666667, ans=0.0 2023-10-14 17:24:07,668 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=22.5 2023-10-14 17:24:22,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1763468.0, ans=0.1 2023-10-14 17:24:50,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1763561.3333333333, ans=0.0 2023-10-14 17:24:53,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1763561.3333333333, ans=0.0 2023-10-14 17:24:55,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1763608.0, ans=0.125 2023-10-14 17:25:13,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1763654.6666666667, ans=0.125 2023-10-14 17:25:18,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=22.5 2023-10-14 17:25:35,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1763748.0, ans=0.125 2023-10-14 17:25:38,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.901e+02 2.045e+02 2.291e+02 2.824e+02, threshold=4.090e+02, percent-clipped=0.0 2023-10-14 17:25:43,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.52 vs. limit=10.0 2023-10-14 17:25:48,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1763794.6666666667, ans=0.125 2023-10-14 17:25:52,632 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.90 vs. limit=12.0 2023-10-14 17:26:00,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1763888.0, ans=0.2 2023-10-14 17:26:02,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1763888.0, ans=0.125 2023-10-14 17:26:08,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1763888.0, ans=0.0 2023-10-14 17:26:12,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1763934.6666666667, ans=0.0 2023-10-14 17:26:14,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1763934.6666666667, ans=0.0 2023-10-14 17:26:16,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1763934.6666666667, ans=0.125 2023-10-14 17:26:32,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1763981.3333333333, ans=0.125 2023-10-14 17:27:05,751 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.36 vs. limit=12.0 2023-10-14 17:27:33,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1764214.6666666667, ans=0.1 2023-10-14 17:27:39,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.893e+02 2.064e+02 2.271e+02 2.785e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 17:27:46,630 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1764261.3333333333, ans=0.0 2023-10-14 17:28:00,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1764308.0, ans=0.2 2023-10-14 17:28:01,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764308.0, ans=0.1 2023-10-14 17:28:11,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1764354.6666666667, ans=0.125 2023-10-14 17:28:11,969 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.14 vs. limit=15.0 2023-10-14 17:28:21,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1764401.3333333333, ans=0.1 2023-10-14 17:28:27,014 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1764401.3333333333, ans=0.0 2023-10-14 17:28:52,705 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-10-14 17:29:04,005 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:29:04,107 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:29:11,308 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.94 vs. limit=15.0 2023-10-14 17:29:23,853 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-14 17:29:37,726 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1764681.3333333333, ans=0.125 2023-10-14 17:29:44,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.823e+02 1.983e+02 2.205e+02 3.512e+02, threshold=3.965e+02, percent-clipped=0.0 2023-10-14 17:29:46,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=22.5 2023-10-14 17:29:55,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1764728.0, ans=0.125 2023-10-14 17:30:12,564 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-10-14 17:30:22,980 INFO [train.py:1031] (3/4) Epoch 28, batch 9500, loss[loss=0.1799, simple_loss=0.279, pruned_loss=0.04044, over 16900.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2779, pruned_loss=0.04615, over 32528890.59 frames. ], batch size: 87, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 17:30:36,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1764914.6666666667, ans=0.125 2023-10-14 17:30:37,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764914.6666666667, ans=0.1 2023-10-14 17:30:40,879 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=22.5 2023-10-14 17:30:47,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1764961.3333333333, ans=0.0 2023-10-14 17:30:54,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1764961.3333333333, ans=0.125 2023-10-14 17:31:06,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1765008.0, ans=0.0 2023-10-14 17:31:11,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1765054.6666666667, ans=0.125 2023-10-14 17:31:18,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1765054.6666666667, ans=0.0 2023-10-14 17:31:25,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=12.0 2023-10-14 17:31:37,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765148.0, ans=0.1 2023-10-14 17:31:43,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.868e+02 2.072e+02 2.289e+02 2.956e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-14 17:31:51,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1765194.6666666667, ans=0.0 2023-10-14 17:32:00,163 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1765241.3333333333, ans=0.0 2023-10-14 17:32:12,832 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:32:44,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1765428.0, ans=0.125 2023-10-14 17:32:52,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=1765428.0, ans=22.5 2023-10-14 17:33:04,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1765474.6666666667, ans=0.125 2023-10-14 17:33:13,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1765521.3333333333, ans=0.125 2023-10-14 17:33:30,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765568.0, ans=0.1 2023-10-14 17:33:44,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-14 17:33:46,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-10-14 17:33:48,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.827e+02 1.965e+02 2.176e+02 3.278e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-14 17:33:57,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1765661.3333333333, ans=0.125 2023-10-14 17:34:16,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1765754.6666666667, ans=0.035 2023-10-14 17:34:16,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.23 vs. limit=15.0 2023-10-14 17:34:24,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1765801.3333333333, ans=0.0 2023-10-14 17:34:42,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.79 vs. limit=15.0 2023-10-14 17:34:55,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.06 vs. limit=15.0 2023-10-14 17:35:19,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1765988.0, ans=0.125 2023-10-14 17:35:29,917 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=22.5 2023-10-14 17:35:42,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1766081.3333333333, ans=0.125 2023-10-14 17:35:46,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.876e+02 2.076e+02 2.280e+02 3.008e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-14 17:35:54,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1766128.0, ans=0.125 2023-10-14 17:36:01,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1766174.6666666667, ans=0.05 2023-10-14 17:36:06,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1766174.6666666667, ans=0.125 2023-10-14 17:36:09,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1766174.6666666667, ans=0.125 2023-10-14 17:36:12,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1766221.3333333333, ans=0.125 2023-10-14 17:36:17,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1766221.3333333333, ans=0.125 2023-10-14 17:36:23,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1766221.3333333333, ans=0.125 2023-10-14 17:36:26,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=12.0 2023-10-14 17:36:36,007 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-14 17:36:59,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1766361.3333333333, ans=0.0 2023-10-14 17:37:04,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1766408.0, ans=0.0 2023-10-14 17:37:07,599 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-14 17:37:26,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1766501.3333333333, ans=0.05 2023-10-14 17:37:41,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1766548.0, ans=0.125 2023-10-14 17:37:50,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1766548.0, ans=0.5 2023-10-14 17:37:55,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.814e+02 1.929e+02 2.085e+02 3.132e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-14 17:38:24,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1766688.0, ans=0.125 2023-10-14 17:38:28,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1766688.0, ans=0.0 2023-10-14 17:38:57,767 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1766828.0, ans=0.125 2023-10-14 17:39:35,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1767014.6666666667, ans=0.125 2023-10-14 17:39:36,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1767014.6666666667, ans=0.0 2023-10-14 17:39:41,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1767014.6666666667, ans=0.0 2023-10-14 17:39:48,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.786e+02 1.935e+02 2.199e+02 2.905e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-14 17:39:52,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1767061.3333333333, ans=0.0 2023-10-14 17:40:07,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1767108.0, ans=0.05 2023-10-14 17:40:19,072 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1767154.6666666667, ans=0.1 2023-10-14 17:40:21,754 INFO [train.py:1031] (3/4) Epoch 28, batch 10000, loss[loss=0.1919, simple_loss=0.2837, pruned_loss=0.05003, over 16776.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2773, pruned_loss=0.04596, over 32577267.45 frames. ], batch size: 188, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 17:40:24,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1767201.3333333333, ans=0.125 2023-10-14 17:40:40,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1767248.0, ans=0.09899494936611666 2023-10-14 17:40:55,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1767341.3333333333, ans=0.2 2023-10-14 17:40:57,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1767341.3333333333, ans=0.1 2023-10-14 17:41:13,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1767388.0, ans=0.1 2023-10-14 17:41:13,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1767388.0, ans=0.125 2023-10-14 17:41:28,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1767434.6666666667, ans=0.125 2023-10-14 17:41:28,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.01 vs. limit=15.0 2023-10-14 17:41:29,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1767481.3333333333, ans=0.95 2023-10-14 17:41:44,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.874e+02 2.049e+02 2.321e+02 2.850e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-14 17:41:52,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1767528.0, ans=0.0 2023-10-14 17:42:04,319 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1767574.6666666667, ans=0.95 2023-10-14 17:42:33,359 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.96 vs. limit=15.0 2023-10-14 17:42:40,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1767714.6666666667, ans=0.0 2023-10-14 17:42:43,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1767714.6666666667, ans=0.125 2023-10-14 17:42:54,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1767761.3333333333, ans=0.0 2023-10-14 17:43:32,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1767901.3333333333, ans=10.0 2023-10-14 17:43:50,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.870e+02 2.097e+02 2.339e+02 3.225e+02, threshold=4.193e+02, percent-clipped=0.0 2023-10-14 17:43:51,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1767994.6666666667, ans=0.0 2023-10-14 17:43:58,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1768041.3333333333, ans=0.125 2023-10-14 17:44:13,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768088.0, ans=0.1 2023-10-14 17:45:25,318 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1768321.3333333333, ans=0.1 2023-10-14 17:45:37,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1768368.0, ans=0.0 2023-10-14 17:45:37,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-14 17:45:39,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1768368.0, ans=0.125 2023-10-14 17:45:51,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1768414.6666666667, ans=0.125 2023-10-14 17:45:58,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.880e+02 2.025e+02 2.235e+02 3.706e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-14 17:46:08,959 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.46 vs. limit=22.5 2023-10-14 17:46:24,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1768554.6666666667, ans=0.2 2023-10-14 17:46:26,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1768554.6666666667, ans=0.125 2023-10-14 17:46:55,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1768648.0, ans=0.025 2023-10-14 17:46:55,524 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.87 vs. limit=15.0 2023-10-14 17:47:04,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1768694.6666666667, ans=0.125 2023-10-14 17:47:12,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1768741.3333333333, ans=0.0 2023-10-14 17:47:18,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1768741.3333333333, ans=10.0 2023-10-14 17:47:38,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1768834.6666666667, ans=0.0 2023-10-14 17:47:44,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1768834.6666666667, ans=0.5 2023-10-14 17:47:48,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1768881.3333333333, ans=0.2 2023-10-14 17:47:49,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1768881.3333333333, ans=0.125 2023-10-14 17:47:49,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1768881.3333333333, ans=0.125 2023-10-14 17:47:58,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1768881.3333333333, ans=0.125 2023-10-14 17:48:02,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.783e+02 1.969e+02 2.150e+02 2.986e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 17:48:04,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1768928.0, ans=0.015 2023-10-14 17:48:05,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1768928.0, ans=0.0 2023-10-14 17:48:06,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1768928.0, ans=0.125 2023-10-14 17:49:05,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1769161.3333333333, ans=0.0 2023-10-14 17:49:09,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=15.0 2023-10-14 17:49:53,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1769301.3333333333, ans=0.125 2023-10-14 17:50:03,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1769348.0, ans=0.125 2023-10-14 17:50:07,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1769394.6666666667, ans=0.07 2023-10-14 17:50:10,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.860e+02 2.015e+02 2.256e+02 3.187e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-14 17:50:11,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1769394.6666666667, ans=0.125 2023-10-14 17:50:36,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1769488.0, ans=0.125 2023-10-14 17:50:43,865 INFO [train.py:1031] (3/4) Epoch 28, batch 10500, loss[loss=0.1624, simple_loss=0.2642, pruned_loss=0.03025, over 16832.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2779, pruned_loss=0.04625, over 32614905.20 frames. ], batch size: 175, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:50:48,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1769534.6666666667, ans=0.0 2023-10-14 17:51:10,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1769628.0, ans=0.0 2023-10-14 17:51:10,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=15.0 2023-10-14 17:51:27,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1769674.6666666667, ans=0.09899494936611666 2023-10-14 17:51:31,702 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.68 vs. limit=12.0 2023-10-14 17:51:34,754 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-10-14 17:51:39,312 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1769721.3333333333, ans=0.0 2023-10-14 17:51:39,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1769721.3333333333, ans=0.125 2023-10-14 17:51:51,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1769768.0, ans=0.0 2023-10-14 17:51:52,383 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.83 vs. limit=15.0 2023-10-14 17:51:55,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1769768.0, ans=0.0 2023-10-14 17:51:55,717 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-10-14 17:52:03,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1769814.6666666667, ans=0.125 2023-10-14 17:52:07,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.81 vs. limit=12.0 2023-10-14 17:52:09,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1769814.6666666667, ans=0.0 2023-10-14 17:52:20,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.838e+02 1.981e+02 2.104e+02 2.965e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-14 17:53:19,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1770048.0, ans=0.0 2023-10-14 17:53:21,978 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-10-14 17:53:23,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1770048.0, ans=0.0 2023-10-14 17:54:02,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1770188.0, ans=0.2 2023-10-14 17:54:11,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1770234.6666666667, ans=0.1 2023-10-14 17:54:16,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1770281.3333333333, ans=0.0 2023-10-14 17:54:19,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1770281.3333333333, ans=0.0 2023-10-14 17:54:30,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.785e+02 1.996e+02 2.167e+02 2.714e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 17:54:31,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1770328.0, ans=0.0 2023-10-14 17:54:43,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770374.6666666667, ans=0.1 2023-10-14 17:54:59,428 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:55:04,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1770421.3333333333, ans=0.2 2023-10-14 17:55:16,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1770468.0, ans=0.125 2023-10-14 17:55:26,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1770514.6666666667, ans=0.125 2023-10-14 17:55:45,594 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1770608.0, ans=0.0 2023-10-14 17:56:06,510 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:56:08,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1770701.3333333333, ans=0.0 2023-10-14 17:56:11,543 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-14 17:56:15,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.80 vs. limit=22.5 2023-10-14 17:56:26,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=1770748.0, ans=0.02 2023-10-14 17:56:41,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.860e+02 2.084e+02 2.369e+02 3.337e+02, threshold=4.167e+02, percent-clipped=0.0 2023-10-14 17:56:45,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1770794.6666666667, ans=0.05 2023-10-14 17:57:09,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1770888.0, ans=0.125 2023-10-14 17:57:31,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1770981.3333333333, ans=0.0 2023-10-14 17:57:35,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770981.3333333333, ans=0.1 2023-10-14 17:57:39,038 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1770981.3333333333, ans=0.0 2023-10-14 17:58:08,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1771121.3333333333, ans=0.025 2023-10-14 17:58:20,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1771168.0, ans=0.125 2023-10-14 17:58:45,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1771261.3333333333, ans=0.125 2023-10-14 17:58:50,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-10-14 17:58:50,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.828e+02 2.003e+02 2.303e+02 3.052e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-14 17:59:38,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1771448.0, ans=0.125 2023-10-14 17:59:53,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1771494.6666666667, ans=0.125 2023-10-14 18:00:03,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1771541.3333333333, ans=0.125 2023-10-14 18:00:22,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1771634.6666666667, ans=0.04949747468305833 2023-10-14 18:00:33,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1771681.3333333333, ans=0.0 2023-10-14 18:00:34,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1771681.3333333333, ans=0.125 2023-10-14 18:00:37,925 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1771681.3333333333, ans=0.0 2023-10-14 18:00:46,225 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1771728.0, ans=0.125 2023-10-14 18:00:50,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.859e+02 2.052e+02 2.288e+02 2.934e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-14 18:00:56,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1771774.6666666667, ans=0.125 2023-10-14 18:00:57,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1771774.6666666667, ans=0.125 2023-10-14 18:01:22,089 INFO [train.py:1031] (3/4) Epoch 28, batch 11000, loss[loss=0.1878, simple_loss=0.2807, pruned_loss=0.04747, over 16891.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2777, pruned_loss=0.04619, over 32665805.40 frames. ], batch size: 87, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 18:01:46,371 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1771961.3333333333, ans=0.125 2023-10-14 18:01:48,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1771961.3333333333, ans=0.0 2023-10-14 18:01:53,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1771961.3333333333, ans=0.0 2023-10-14 18:01:54,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.22 vs. limit=10.0 2023-10-14 18:02:02,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1772008.0, ans=0.125 2023-10-14 18:02:22,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1772054.6666666667, ans=0.0 2023-10-14 18:02:26,832 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=22.5 2023-10-14 18:02:55,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1772194.6666666667, ans=0.125 2023-10-14 18:02:56,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.846e+02 2.026e+02 2.213e+02 2.856e+02, threshold=4.052e+02, percent-clipped=0.0 2023-10-14 18:03:29,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-14 18:03:48,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1772381.3333333333, ans=0.125 2023-10-14 18:03:50,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1772381.3333333333, ans=0.125 2023-10-14 18:03:55,445 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-10-14 18:04:02,698 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1772428.0, ans=0.0 2023-10-14 18:04:04,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1772428.0, ans=0.1 2023-10-14 18:04:30,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1772521.3333333333, ans=0.0 2023-10-14 18:04:34,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1772521.3333333333, ans=0.125 2023-10-14 18:04:34,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1772521.3333333333, ans=0.0 2023-10-14 18:04:51,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1772568.0, ans=0.125 2023-10-14 18:04:53,680 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:04:56,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1772614.6666666667, ans=0.125 2023-10-14 18:05:14,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.767e+02 1.964e+02 2.270e+02 3.947e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-14 18:05:20,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1772661.3333333333, ans=0.125 2023-10-14 18:05:53,180 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-10-14 18:05:57,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1772801.3333333333, ans=0.2 2023-10-14 18:06:01,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1772801.3333333333, ans=0.0 2023-10-14 18:06:44,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1772988.0, ans=0.125 2023-10-14 18:06:51,592 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:06:54,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773034.6666666667, ans=0.1 2023-10-14 18:06:58,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1773034.6666666667, ans=0.0 2023-10-14 18:07:27,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1773128.0, ans=0.2 2023-10-14 18:07:28,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.830e+02 2.022e+02 2.274e+02 3.400e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-14 18:07:34,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1773128.0, ans=0.2 2023-10-14 18:08:03,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1773268.0, ans=0.0 2023-10-14 18:08:04,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-10-14 18:08:21,166 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.59 vs. limit=5.0 2023-10-14 18:08:26,542 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1773361.3333333333, ans=0.125 2023-10-14 18:08:27,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1773361.3333333333, ans=0.125 2023-10-14 18:08:51,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.75 vs. limit=10.0 2023-10-14 18:09:08,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1773454.6666666667, ans=0.125 2023-10-14 18:09:08,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1773454.6666666667, ans=0.125 2023-10-14 18:09:15,009 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1773501.3333333333, ans=0.125 2023-10-14 18:09:18,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773501.3333333333, ans=0.1 2023-10-14 18:09:33,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1773548.0, ans=0.1 2023-10-14 18:09:38,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1773548.0, ans=0.1 2023-10-14 18:09:43,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-10-14 18:09:48,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.861e+02 2.016e+02 2.275e+02 3.565e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-14 18:09:52,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1773594.6666666667, ans=0.125 2023-10-14 18:10:01,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-10-14 18:10:31,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.49 vs. limit=15.0 2023-10-14 18:10:35,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.56 vs. limit=22.5 2023-10-14 18:10:58,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1773828.0, ans=0.0 2023-10-14 18:11:01,771 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:12:25,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.999e+02 2.173e+02 2.454e+02 4.125e+02, threshold=4.347e+02, percent-clipped=1.0 2023-10-14 18:12:57,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774154.6666666667, ans=0.1 2023-10-14 18:12:59,575 INFO [train.py:1031] (3/4) Epoch 28, batch 11500, loss[loss=0.1995, simple_loss=0.3005, pruned_loss=0.04923, over 16677.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2775, pruned_loss=0.04614, over 32687441.94 frames. ], batch size: 202, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 18:14:14,477 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-10-14 18:14:24,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-10-14 18:14:44,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1774528.0, ans=0.125 2023-10-14 18:14:47,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.807e+02 2.027e+02 2.215e+02 3.004e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-14 18:14:54,858 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1774528.0, ans=0.2 2023-10-14 18:14:54,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1774528.0, ans=0.2 2023-10-14 18:15:09,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1774574.6666666667, ans=0.125 2023-10-14 18:15:10,030 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1774574.6666666667, ans=0.2 2023-10-14 18:15:45,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1774714.6666666667, ans=0.04949747468305833 2023-10-14 18:16:13,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774808.0, ans=0.1 2023-10-14 18:16:23,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1774808.0, ans=0.0 2023-10-14 18:16:36,697 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=3.138e-02 2023-10-14 18:17:02,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774948.0, ans=0.1 2023-10-14 18:17:33,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.760e+02 1.954e+02 2.146e+02 2.843e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-14 18:17:37,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1774994.6666666667, ans=0.1 2023-10-14 18:17:38,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1775041.3333333333, ans=0.125 2023-10-14 18:17:48,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1775041.3333333333, ans=0.125 2023-10-14 18:18:00,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1775088.0, ans=0.1 2023-10-14 18:18:02,543 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775088.0, ans=0.1 2023-10-14 18:18:08,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1775134.6666666667, ans=0.125 2023-10-14 18:18:19,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1775134.6666666667, ans=0.125 2023-10-14 18:18:35,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1775228.0, ans=0.125 2023-10-14 18:18:48,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1775228.0, ans=0.125 2023-10-14 18:19:03,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1775321.3333333333, ans=0.2 2023-10-14 18:19:34,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1775368.0, ans=0.125 2023-10-14 18:19:34,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1775368.0, ans=0.04949747468305833 2023-10-14 18:19:34,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1775368.0, ans=0.125 2023-10-14 18:20:05,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.845e+02 1.997e+02 2.188e+02 2.742e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 18:20:05,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1775461.3333333333, ans=0.1 2023-10-14 18:20:18,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1775508.0, ans=0.0 2023-10-14 18:20:23,293 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-10-14 18:20:29,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1775508.0, ans=0.0 2023-10-14 18:20:52,536 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:21:28,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1775694.6666666667, ans=0.125 2023-10-14 18:21:47,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1775741.3333333333, ans=0.2 2023-10-14 18:21:51,633 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1775741.3333333333, ans=0.125 2023-10-14 18:22:03,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1775788.0, ans=0.125 2023-10-14 18:22:46,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1775928.0, ans=0.125 2023-10-14 18:22:49,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.849e+02 2.030e+02 2.259e+02 3.116e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-14 18:23:03,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775974.6666666667, ans=0.1 2023-10-14 18:23:08,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775974.6666666667, ans=0.1 2023-10-14 18:23:18,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1776021.3333333333, ans=0.125 2023-10-14 18:23:21,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776021.3333333333, ans=0.1 2023-10-14 18:23:28,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1776068.0, ans=0.2 2023-10-14 18:24:05,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1776161.3333333333, ans=0.0 2023-10-14 18:24:06,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1776161.3333333333, ans=0.125 2023-10-14 18:24:16,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1776208.0, ans=0.0 2023-10-14 18:24:27,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1776254.6666666667, ans=0.125 2023-10-14 18:24:41,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1776301.3333333333, ans=0.125 2023-10-14 18:24:48,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1776301.3333333333, ans=0.125 2023-10-14 18:24:53,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1776301.3333333333, ans=0.2 2023-10-14 18:25:13,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1776348.0, ans=0.0 2023-10-14 18:25:24,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.844e+02 2.117e+02 2.402e+02 3.426e+02, threshold=4.235e+02, percent-clipped=0.0 2023-10-14 18:25:38,428 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776441.3333333333, ans=0.1 2023-10-14 18:25:39,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1776441.3333333333, ans=0.125 2023-10-14 18:25:59,538 INFO [train.py:1031] (3/4) Epoch 28, batch 12000, loss[loss=0.2029, simple_loss=0.2969, pruned_loss=0.05442, over 16834.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2775, pruned_loss=0.0458, over 32725825.62 frames. ], batch size: 155, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 18:26:24,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1776581.3333333333, ans=0.125 2023-10-14 18:26:49,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1776674.6666666667, ans=0.2 2023-10-14 18:26:53,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776674.6666666667, ans=0.1 2023-10-14 18:27:12,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1776721.3333333333, ans=0.125 2023-10-14 18:27:15,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1776768.0, ans=0.125 2023-10-14 18:27:36,268 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-10-14 18:27:51,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1776861.3333333333, ans=0.125 2023-10-14 18:27:51,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.67 vs. limit=10.0 2023-10-14 18:27:52,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.796e+02 1.974e+02 2.184e+02 3.243e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-14 18:27:55,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1776861.3333333333, ans=0.125 2023-10-14 18:28:20,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1776954.6666666667, ans=0.125 2023-10-14 18:28:23,557 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-10-14 18:28:30,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-14 18:28:43,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1777048.0, ans=0.125 2023-10-14 18:29:07,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1777094.6666666667, ans=0.125 2023-10-14 18:29:15,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1777094.6666666667, ans=0.125 2023-10-14 18:29:54,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1777234.6666666667, ans=0.125 2023-10-14 18:30:14,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1777328.0, ans=0.125 2023-10-14 18:30:21,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.868e+02 2.112e+02 2.473e+02 3.637e+02, threshold=4.224e+02, percent-clipped=0.0 2023-10-14 18:31:52,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1777608.0, ans=0.0 2023-10-14 18:32:12,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1777701.3333333333, ans=0.125 2023-10-14 18:32:12,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-10-14 18:32:29,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.58 vs. limit=10.0 2023-10-14 18:32:31,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1777748.0, ans=0.125 2023-10-14 18:32:32,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1777748.0, ans=0.0 2023-10-14 18:32:34,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1777748.0, ans=0.0 2023-10-14 18:32:49,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1777794.6666666667, ans=0.125 2023-10-14 18:32:56,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.824e+02 1.979e+02 2.184e+02 2.986e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-14 18:33:01,767 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:33:09,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1777841.3333333333, ans=0.0 2023-10-14 18:33:11,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1777841.3333333333, ans=0.125 2023-10-14 18:33:19,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777888.0, ans=0.1 2023-10-14 18:34:10,141 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1777981.3333333333, ans=15.0 2023-10-14 18:34:18,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1778028.0, ans=0.0 2023-10-14 18:34:45,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1778121.3333333333, ans=0.1 2023-10-14 18:34:58,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1778168.0, ans=0.125 2023-10-14 18:35:12,659 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=22.5 2023-10-14 18:35:13,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1778214.6666666667, ans=0.1 2023-10-14 18:35:13,939 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=22.5 2023-10-14 18:35:26,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1778261.3333333333, ans=0.0 2023-10-14 18:35:32,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.864e+02 2.012e+02 2.236e+02 3.219e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-14 18:35:35,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1778261.3333333333, ans=0.125 2023-10-14 18:35:48,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.60 vs. limit=22.5 2023-10-14 18:35:52,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.01 vs. limit=10.0 2023-10-14 18:36:06,516 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1778401.3333333333, ans=0.2 2023-10-14 18:36:21,066 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=22.5 2023-10-14 18:36:28,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1778494.6666666667, ans=0.125 2023-10-14 18:36:36,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1778541.3333333333, ans=0.125 2023-10-14 18:37:01,125 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1778634.6666666667, ans=0.125 2023-10-14 18:37:20,647 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.42 vs. limit=10.0 2023-10-14 18:37:32,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.919e+02 2.160e+02 2.359e+02 3.188e+02, threshold=4.320e+02, percent-clipped=0.0 2023-10-14 18:37:37,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1778774.6666666667, ans=0.0 2023-10-14 18:37:37,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1778774.6666666667, ans=0.125 2023-10-14 18:37:46,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=15.0 2023-10-14 18:37:49,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1778821.3333333333, ans=0.1 2023-10-14 18:37:59,729 INFO [train.py:1031] (3/4) Epoch 28, batch 12500, loss[loss=0.1998, simple_loss=0.2894, pruned_loss=0.05509, over 16061.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2772, pruned_loss=0.0459, over 32709882.60 frames. ], batch size: 296, lr: 1.21e-03, grad_scale: 16.0 2023-10-14 18:38:04,690 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:38:19,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1778914.6666666667, ans=0.1 2023-10-14 18:38:43,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1779054.6666666667, ans=0.125 2023-10-14 18:38:46,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1779054.6666666667, ans=0.2 2023-10-14 18:38:48,013 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1779054.6666666667, ans=0.1 2023-10-14 18:39:00,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1779101.3333333333, ans=0.0 2023-10-14 18:39:23,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.850e+02 2.015e+02 2.266e+02 3.062e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-14 18:39:42,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.87 vs. limit=22.5 2023-10-14 18:39:44,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1779288.0, ans=0.0 2023-10-14 18:40:06,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1779381.3333333333, ans=10.0 2023-10-14 18:40:15,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1779428.0, ans=0.2 2023-10-14 18:40:18,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1779428.0, ans=0.1 2023-10-14 18:40:52,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1779568.0, ans=0.0 2023-10-14 18:41:06,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1779614.6666666667, ans=0.1 2023-10-14 18:41:20,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.844e+02 1.985e+02 2.171e+02 2.842e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-14 18:41:38,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1779754.6666666667, ans=0.125 2023-10-14 18:41:50,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1779801.3333333333, ans=0.125 2023-10-14 18:42:28,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1779941.3333333333, ans=0.125 2023-10-14 18:42:28,855 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1779941.3333333333, ans=0.125 2023-10-14 18:42:42,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-14 18:42:48,367 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-10-14 18:43:06,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1780081.3333333333, ans=0.2 2023-10-14 18:43:12,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1780128.0, ans=0.2 2023-10-14 18:43:15,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.86 vs. limit=15.0 2023-10-14 18:43:16,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.819e+02 1.998e+02 2.182e+02 3.079e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-14 18:43:18,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1780128.0, ans=0.025 2023-10-14 18:43:25,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1780174.6666666667, ans=0.0 2023-10-14 18:44:02,490 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1780314.6666666667, ans=0.125 2023-10-14 18:44:03,508 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1780361.3333333333, ans=0.125 2023-10-14 18:44:16,705 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1780408.0, ans=0.2 2023-10-14 18:44:23,038 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-14 18:44:27,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1780408.0, ans=0.0 2023-10-14 18:44:42,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1780501.3333333333, ans=0.1 2023-10-14 18:45:15,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.826e+02 1.980e+02 2.187e+02 3.955e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-14 18:45:20,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1780641.3333333333, ans=0.0 2023-10-14 18:45:59,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1780781.3333333333, ans=0.0 2023-10-14 18:46:09,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1780828.0, ans=0.125 2023-10-14 18:46:10,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1780828.0, ans=0.125 2023-10-14 18:46:10,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1780828.0, ans=0.07 2023-10-14 18:46:13,776 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1780828.0, ans=0.2 2023-10-14 18:46:20,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1780874.6666666667, ans=0.0 2023-10-14 18:46:43,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1780968.0, ans=0.2 2023-10-14 18:46:43,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1780968.0, ans=0.1 2023-10-14 18:46:51,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.45 vs. limit=10.0 2023-10-14 18:46:53,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781014.6666666667, ans=0.1 2023-10-14 18:47:08,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1781061.3333333333, ans=0.05 2023-10-14 18:47:11,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.773e+02 1.943e+02 2.150e+02 2.610e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-14 18:47:21,288 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1781108.0, ans=0.125 2023-10-14 18:47:24,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2023-10-14 18:47:39,961 INFO [train.py:1031] (3/4) Epoch 28, batch 13000, loss[loss=0.1884, simple_loss=0.2753, pruned_loss=0.05074, over 15923.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2777, pruned_loss=0.046, over 32726157.58 frames. ], batch size: 43, lr: 1.21e-03, grad_scale: 32.0 2023-10-14 18:47:48,256 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.99 vs. limit=15.0 2023-10-14 18:47:58,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1781248.0, ans=0.0 2023-10-14 18:48:12,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1781294.6666666667, ans=0.2 2023-10-14 18:48:12,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1781294.6666666667, ans=0.0 2023-10-14 18:49:15,442 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-10-14 18:49:22,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781528.0, ans=0.1 2023-10-14 18:49:28,371 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:49:31,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.855e+02 2.068e+02 2.346e+02 3.226e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-14 18:49:40,107 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-10-14 18:49:42,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1781574.6666666667, ans=0.2 2023-10-14 18:49:59,956 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2023-10-14 18:50:03,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1781668.0, ans=0.0 2023-10-14 18:50:03,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1781668.0, ans=0.125 2023-10-14 18:50:23,398 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1781761.3333333333, ans=0.125 2023-10-14 18:50:46,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1781808.0, ans=0.1 2023-10-14 18:50:51,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1781854.6666666667, ans=0.0 2023-10-14 18:50:54,635 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-14 18:51:18,648 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.26 vs. limit=10.0 2023-10-14 18:51:23,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1781948.0, ans=0.125 2023-10-14 18:51:37,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.859e+02 2.059e+02 2.320e+02 3.388e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-14 18:52:12,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1782134.6666666667, ans=0.035 2023-10-14 18:52:22,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1782181.3333333333, ans=0.04949747468305833 2023-10-14 18:52:50,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1782274.6666666667, ans=10.0 2023-10-14 18:52:51,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1782274.6666666667, ans=0.1 2023-10-14 18:53:18,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=1782368.0, ans=0.2 2023-10-14 18:53:34,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1782414.6666666667, ans=0.125 2023-10-14 18:53:47,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.790e+02 1.901e+02 2.120e+02 3.380e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-14 18:53:48,913 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:53:54,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1782508.0, ans=0.0 2023-10-14 18:54:41,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-10-14 18:54:42,361 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-10-14 18:54:43,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1782694.6666666667, ans=0.2 2023-10-14 18:54:58,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1782741.3333333333, ans=0.125 2023-10-14 18:55:12,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1782788.0, ans=0.125 2023-10-14 18:55:24,993 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-10-14 18:55:38,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1782928.0, ans=0.1 2023-10-14 18:55:50,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.833e+02 2.038e+02 2.248e+02 3.459e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-14 18:56:09,780 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.80 vs. limit=15.0 2023-10-14 18:56:30,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1783068.0, ans=0.125 2023-10-14 18:56:44,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1783161.3333333333, ans=0.125 2023-10-14 18:57:03,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1783208.0, ans=0.125 2023-10-14 18:57:03,366 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1783208.0, ans=0.125 2023-10-14 18:57:11,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1783254.6666666667, ans=0.125 2023-10-14 18:57:17,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-10-14 18:57:23,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1783301.3333333333, ans=0.2 2023-10-14 18:57:41,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1783348.0, ans=0.125 2023-10-14 18:57:57,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.838e+02 2.011e+02 2.192e+02 2.929e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-14 18:58:03,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1783441.3333333333, ans=0.125 2023-10-14 18:58:03,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783441.3333333333, ans=0.1 2023-10-14 18:58:23,843 INFO [train.py:1031] (3/4) Epoch 28, batch 13500, loss[loss=0.1882, simple_loss=0.2516, pruned_loss=0.06235, over 12378.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2773, pruned_loss=0.046, over 32743030.78 frames. ], batch size: 440, lr: 1.21e-03, grad_scale: 16.0 2023-10-14 18:58:25,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1783534.6666666667, ans=0.125 2023-10-14 18:58:28,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1783534.6666666667, ans=0.2 2023-10-14 18:58:40,979 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:58:46,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1783581.3333333333, ans=0.125 2023-10-14 18:59:13,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1783721.3333333333, ans=0.125 2023-10-14 18:59:14,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1783721.3333333333, ans=0.0 2023-10-14 18:59:18,987 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:59:54,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1783861.3333333333, ans=0.0 2023-10-14 19:00:00,010 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:00:03,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.891e+02 2.072e+02 2.277e+02 3.149e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-14 19:00:09,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1783908.0, ans=0.1 2023-10-14 19:00:14,707 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=22.5 2023-10-14 19:00:29,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1784001.3333333333, ans=0.125 2023-10-14 19:00:37,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1784001.3333333333, ans=0.0 2023-10-14 19:00:42,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1784048.0, ans=0.125 2023-10-14 19:00:44,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1784048.0, ans=0.125 2023-10-14 19:01:10,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1784141.3333333333, ans=0.0 2023-10-14 19:01:12,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1784188.0, ans=0.125 2023-10-14 19:01:13,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1784188.0, ans=0.125 2023-10-14 19:01:21,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1784188.0, ans=0.125 2023-10-14 19:02:05,255 INFO [train.py:1031] (3/4) Epoch 29, batch 0, loss[loss=0.1519, simple_loss=0.2475, pruned_loss=0.02812, over 16873.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2475, pruned_loss=0.02812, over 16873.00 frames. ], batch size: 104, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:02:05,257 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-14 19:02:11,281 INFO [zipformer.py:1853] (3/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5606, 2.3376, 1.9225, 3.6456], device='cuda:3') 2023-10-14 19:02:11,916 INFO [zipformer.py:1853] (3/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([2.2277, 3.6564, 3.1553, 3.4815, 2.8333, 2.6203, 3.7252, 3.0089], device='cuda:3') 2023-10-14 19:02:13,484 INFO [train.py:1063] (3/4) Epoch 29, validation: loss=0.2131, simple_loss=0.2995, pruned_loss=0.06338, over 1020973.00 frames. 2023-10-14 19:02:13,485 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-14 19:02:15,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.86 vs. limit=15.0 2023-10-14 19:02:45,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.901e+02 2.123e+02 2.345e+02 3.738e+02, threshold=4.247e+02, percent-clipped=0.0 2023-10-14 19:02:48,640 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.31 vs. limit=10.0 2023-10-14 19:03:29,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1784491.3333333333, ans=0.2 2023-10-14 19:03:48,575 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:03:56,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1784584.6666666667, ans=0.125 2023-10-14 19:04:19,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1784678.0, ans=0.0 2023-10-14 19:04:21,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=22.5 2023-10-14 19:04:58,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.817e+02 1.909e+02 2.080e+02 2.817e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-14 19:05:29,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1784911.3333333333, ans=0.125 2023-10-14 19:05:53,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.06 vs. limit=12.0 2023-10-14 19:06:10,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.48 vs. limit=15.0 2023-10-14 19:06:20,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1785144.6666666667, ans=0.0 2023-10-14 19:06:23,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785144.6666666667, ans=0.1 2023-10-14 19:07:01,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1785284.6666666667, ans=0.125 2023-10-14 19:07:03,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.822e+02 2.063e+02 2.222e+02 3.116e+02, threshold=4.126e+02, percent-clipped=0.0 2023-10-14 19:07:21,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1785378.0, ans=0.07 2023-10-14 19:07:27,473 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1785378.0, ans=0.1 2023-10-14 19:08:15,566 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:08:29,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.80 vs. limit=6.0 2023-10-14 19:08:44,023 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2023-10-14 19:08:47,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1785658.0, ans=0.1 2023-10-14 19:09:04,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1785704.6666666667, ans=0.05 2023-10-14 19:09:08,387 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:09:09,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1785751.3333333333, ans=0.2 2023-10-14 19:09:18,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.795e+02 1.962e+02 2.085e+02 4.412e+02, threshold=3.924e+02, percent-clipped=1.0 2023-10-14 19:10:03,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1785938.0, ans=0.0 2023-10-14 19:10:15,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1785984.6666666667, ans=0.125 2023-10-14 19:10:18,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1785984.6666666667, ans=0.125 2023-10-14 19:10:19,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1785984.6666666667, ans=0.1 2023-10-14 19:10:22,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1785984.6666666667, ans=0.125 2023-10-14 19:10:35,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1786031.3333333333, ans=0.0 2023-10-14 19:10:40,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1786078.0, ans=0.125 2023-10-14 19:10:53,676 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:10:54,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1786124.6666666667, ans=0.125 2023-10-14 19:11:05,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1786171.3333333333, ans=0.125 2023-10-14 19:11:05,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1786171.3333333333, ans=0.0 2023-10-14 19:11:07,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1786171.3333333333, ans=0.125 2023-10-14 19:11:13,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1786171.3333333333, ans=0.125 2023-10-14 19:11:17,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786218.0, ans=0.1 2023-10-14 19:11:25,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.935e+02 2.088e+02 2.377e+02 3.287e+02, threshold=4.176e+02, percent-clipped=0.0 2023-10-14 19:12:03,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-10-14 19:12:13,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1786358.0, ans=0.125 2023-10-14 19:12:25,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1786404.6666666667, ans=0.0 2023-10-14 19:12:44,563 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1786498.0, ans=0.0 2023-10-14 19:12:49,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1786498.0, ans=0.0 2023-10-14 19:13:08,916 INFO [train.py:1031] (3/4) Epoch 29, batch 500, loss[loss=0.1889, simple_loss=0.2739, pruned_loss=0.05196, over 16942.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.278, pruned_loss=0.04643, over 7270791.03 frames. ], batch size: 77, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:13:32,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786638.0, ans=0.1 2023-10-14 19:13:33,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1786684.6666666667, ans=0.0 2023-10-14 19:13:41,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.826e+02 2.008e+02 2.215e+02 3.537e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-14 19:13:42,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1786684.6666666667, ans=0.0 2023-10-14 19:13:50,123 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.06 vs. limit=15.0 2023-10-14 19:14:16,237 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:14:23,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1786871.3333333333, ans=0.0 2023-10-14 19:14:58,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1786964.6666666667, ans=0.0 2023-10-14 19:15:05,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1787011.3333333333, ans=0.1 2023-10-14 19:15:20,407 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.65 vs. limit=10.0 2023-10-14 19:15:30,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1787104.6666666667, ans=0.1 2023-10-14 19:15:47,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787151.3333333333, ans=0.1 2023-10-14 19:15:48,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.877e+02 2.016e+02 2.278e+02 3.167e+02, threshold=4.031e+02, percent-clipped=0.0 2023-10-14 19:16:00,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787198.0, ans=0.1 2023-10-14 19:16:30,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1787338.0, ans=0.2 2023-10-14 19:16:31,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=1787338.0, ans=0.02 2023-10-14 19:16:44,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1787384.6666666667, ans=0.0 2023-10-14 19:17:05,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1787431.3333333333, ans=0.5 2023-10-14 19:17:26,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1787524.6666666667, ans=0.125 2023-10-14 19:17:26,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1787524.6666666667, ans=0.125 2023-10-14 19:17:34,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1787571.3333333333, ans=0.0 2023-10-14 19:17:53,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.911e+02 2.014e+02 2.243e+02 3.292e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-14 19:17:53,795 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:17:54,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-10-14 19:18:04,840 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.46 vs. limit=15.0 2023-10-14 19:18:55,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-10-14 19:19:13,670 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.15 vs. limit=15.0 2023-10-14 19:19:47,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1787991.3333333333, ans=0.125 2023-10-14 19:20:20,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.945e+02 2.097e+02 2.352e+02 3.649e+02, threshold=4.194e+02, percent-clipped=0.0 2023-10-14 19:20:40,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1788131.3333333333, ans=0.0 2023-10-14 19:20:58,608 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1788224.6666666667, ans=0.125 2023-10-14 19:21:13,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788271.3333333333, ans=0.1 2023-10-14 19:21:29,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1788318.0, ans=0.125 2023-10-14 19:21:50,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1788411.3333333333, ans=0.0 2023-10-14 19:21:59,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1788411.3333333333, ans=0.125 2023-10-14 19:22:10,341 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.10 vs. limit=15.0 2023-10-14 19:22:37,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.825e+02 2.024e+02 2.199e+02 2.831e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-14 19:22:48,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1788598.0, ans=0.125 2023-10-14 19:23:01,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1788644.6666666667, ans=0.0 2023-10-14 19:23:39,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1788784.6666666667, ans=0.125 2023-10-14 19:23:41,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1788784.6666666667, ans=0.0 2023-10-14 19:23:42,577 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1788784.6666666667, ans=0.05 2023-10-14 19:23:48,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788784.6666666667, ans=0.1 2023-10-14 19:23:48,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788784.6666666667, ans=0.1 2023-10-14 19:23:53,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1788831.3333333333, ans=0.0 2023-10-14 19:23:55,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1788831.3333333333, ans=0.1 2023-10-14 19:24:06,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1788878.0, ans=0.125 2023-10-14 19:24:19,183 INFO [train.py:1031] (3/4) Epoch 29, batch 1000, loss[loss=0.1875, simple_loss=0.2807, pruned_loss=0.04714, over 16932.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2787, pruned_loss=0.04662, over 12934114.75 frames. ], batch size: 110, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:24:27,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.45 vs. limit=22.5 2023-10-14 19:24:45,395 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-10-14 19:24:49,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.765e+02 1.924e+02 2.151e+02 2.640e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 19:24:51,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1789018.0, ans=0.1 2023-10-14 19:24:57,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-10-14 19:25:03,911 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1789064.6666666667, ans=0.0 2023-10-14 19:25:27,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.30 vs. limit=15.0 2023-10-14 19:25:57,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1789298.0, ans=0.125 2023-10-14 19:26:05,157 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1789298.0, ans=0.2 2023-10-14 19:26:11,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1789344.6666666667, ans=0.0 2023-10-14 19:26:17,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1789344.6666666667, ans=0.125 2023-10-14 19:26:20,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1789344.6666666667, ans=0.015 2023-10-14 19:26:44,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1789438.0, ans=0.125 2023-10-14 19:26:48,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1789438.0, ans=0.1 2023-10-14 19:26:57,241 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1789484.6666666667, ans=0.125 2023-10-14 19:26:59,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1789484.6666666667, ans=0.125 2023-10-14 19:26:59,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.851e+02 2.013e+02 2.294e+02 3.227e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-14 19:27:07,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1789531.3333333333, ans=0.125 2023-10-14 19:27:18,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1789578.0, ans=0.0 2023-10-14 19:27:40,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1789624.6666666667, ans=0.125 2023-10-14 19:27:42,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1789624.6666666667, ans=0.0 2023-10-14 19:28:11,034 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-10-14 19:28:12,804 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=12.0 2023-10-14 19:29:25,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.745e+02 1.905e+02 2.132e+02 2.871e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-14 19:29:31,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1789998.0, ans=0.2 2023-10-14 19:29:42,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1790044.6666666667, ans=0.1 2023-10-14 19:29:48,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1790044.6666666667, ans=0.125 2023-10-14 19:29:51,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1790044.6666666667, ans=0.125 2023-10-14 19:31:07,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1790324.6666666667, ans=0.1 2023-10-14 19:31:36,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.766e+02 1.912e+02 2.084e+02 3.000e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-14 19:32:02,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1790511.3333333333, ans=0.1 2023-10-14 19:32:15,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=1790604.6666666667, ans=0.02 2023-10-14 19:32:21,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1790604.6666666667, ans=0.09899494936611666 2023-10-14 19:32:27,956 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1790651.3333333333, ans=0.125 2023-10-14 19:32:39,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1790698.0, ans=0.125 2023-10-14 19:32:42,115 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1790698.0, ans=0.125 2023-10-14 19:32:49,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1790698.0, ans=0.0 2023-10-14 19:33:11,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2023-10-14 19:33:21,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1790838.0, ans=0.1 2023-10-14 19:33:40,329 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.783e+02 1.989e+02 2.163e+02 2.964e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-14 19:34:09,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1790978.0, ans=0.125 2023-10-14 19:34:14,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1791024.6666666667, ans=0.0 2023-10-14 19:35:27,078 INFO [train.py:1031] (3/4) Epoch 29, batch 1500, loss[loss=0.1751, simple_loss=0.2653, pruned_loss=0.0425, over 16047.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.277, pruned_loss=0.04585, over 17341232.53 frames. ], batch size: 43, lr: 1.19e-03, grad_scale: 16.0 2023-10-14 19:35:36,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791258.0, ans=0.1 2023-10-14 19:35:55,253 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:36:14,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.832e+02 2.009e+02 2.326e+02 3.322e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 19:36:36,997 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791444.6666666667, ans=0.1 2023-10-14 19:36:41,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1791444.6666666667, ans=0.125 2023-10-14 19:36:42,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1791444.6666666667, ans=0.0 2023-10-14 19:36:50,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=12.0 2023-10-14 19:36:56,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1791491.3333333333, ans=0.0 2023-10-14 19:37:18,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-10-14 19:37:32,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1791631.3333333333, ans=0.0 2023-10-14 19:37:39,946 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1791631.3333333333, ans=0.125 2023-10-14 19:37:41,345 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=22.5 2023-10-14 19:37:53,255 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1791678.0, ans=0.0 2023-10-14 19:37:55,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1791678.0, ans=0.125 2023-10-14 19:37:59,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1791678.0, ans=0.0 2023-10-14 19:38:03,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791724.6666666667, ans=0.1 2023-10-14 19:38:08,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1791724.6666666667, ans=0.05 2023-10-14 19:38:50,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.903e+02 2.166e+02 2.383e+02 3.382e+02, threshold=4.333e+02, percent-clipped=0.0 2023-10-14 19:39:17,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1791911.3333333333, ans=0.125 2023-10-14 19:39:20,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791911.3333333333, ans=0.1 2023-10-14 19:40:02,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1792004.6666666667, ans=0.05 2023-10-14 19:40:11,725 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.01 vs. limit=22.5 2023-10-14 19:40:34,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1792098.0, ans=0.0 2023-10-14 19:41:10,624 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1792238.0, ans=0.125 2023-10-14 19:41:18,185 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-14 19:41:29,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1792284.6666666667, ans=0.125 2023-10-14 19:41:38,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.817e+02 1.967e+02 2.158e+02 2.658e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-14 19:41:52,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1792331.3333333333, ans=0.125 2023-10-14 19:41:59,307 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=22.5 2023-10-14 19:42:02,329 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-14 19:42:06,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1792378.0, ans=0.125 2023-10-14 19:42:24,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1792424.6666666667, ans=0.125 2023-10-14 19:42:35,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.88 vs. limit=22.5 2023-10-14 19:42:44,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1792518.0, ans=0.2 2023-10-14 19:42:51,432 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:43:02,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1792518.0, ans=0.125 2023-10-14 19:43:13,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1792564.6666666667, ans=0.025 2023-10-14 19:43:16,953 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1792564.6666666667, ans=0.125 2023-10-14 19:43:35,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1792611.3333333333, ans=0.125 2023-10-14 19:43:36,284 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-10-14 19:43:38,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1792611.3333333333, ans=0.125 2023-10-14 19:43:46,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1792658.0, ans=15.0 2023-10-14 19:43:48,305 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792658.0, ans=0.1 2023-10-14 19:43:49,487 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2023-10-14 19:43:52,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1792658.0, ans=0.125 2023-10-14 19:44:45,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.852e+02 2.057e+02 2.251e+02 3.007e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-14 19:44:54,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1792798.0, ans=0.1 2023-10-14 19:45:18,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1792844.6666666667, ans=0.125 2023-10-14 19:45:27,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.25 vs. limit=10.0 2023-10-14 19:45:51,717 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1792891.3333333333, ans=0.125 2023-10-14 19:48:12,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793218.0, ans=0.1 2023-10-14 19:48:14,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=22.5 2023-10-14 19:48:18,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.855e+02 2.035e+02 2.252e+02 3.047e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-14 19:49:32,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-10-14 19:49:44,348 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.27 vs. limit=15.0 2023-10-14 19:49:59,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1793451.3333333333, ans=0.2 2023-10-14 19:50:12,868 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1793451.3333333333, ans=0.025 2023-10-14 19:50:27,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.25 vs. limit=22.5 2023-10-14 19:50:28,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1793498.0, ans=0.0 2023-10-14 19:50:41,160 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.31 vs. limit=10.0 2023-10-14 19:51:14,765 INFO [train.py:1031] (3/4) Epoch 29, batch 2000, loss[loss=0.1833, simple_loss=0.2876, pruned_loss=0.0395, over 16890.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2777, pruned_loss=0.04616, over 20742014.93 frames. ], batch size: 165, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:51:18,227 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:51:50,182 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:52:10,214 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.60 vs. limit=22.5 2023-10-14 19:52:14,692 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:52:22,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.833e+02 2.026e+02 2.265e+02 2.973e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-14 19:52:51,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1793778.0, ans=0.0 2023-10-14 19:53:10,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-14 19:53:18,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1793778.0, ans=0.0 2023-10-14 19:53:21,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1793824.6666666667, ans=0.125 2023-10-14 19:53:35,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1793824.6666666667, ans=0.2 2023-10-14 19:54:13,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2023-10-14 19:54:20,028 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793918.0, ans=0.125 2023-10-14 19:54:43,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1793964.6666666667, ans=0.125 2023-10-14 19:54:47,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1794011.3333333333, ans=0.09899494936611666 2023-10-14 19:55:01,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1794011.3333333333, ans=0.125 2023-10-14 19:55:01,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1794011.3333333333, ans=0.125 2023-10-14 19:55:02,617 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.98 vs. limit=10.0 2023-10-14 19:55:05,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1794011.3333333333, ans=10.0 2023-10-14 19:55:13,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794058.0, ans=0.1 2023-10-14 19:55:41,578 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794104.6666666667, ans=0.1 2023-10-14 19:56:36,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.753e+02 2.006e+02 2.175e+02 3.076e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 19:56:42,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1794198.0, ans=0.125 2023-10-14 19:56:52,137 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-14 19:56:53,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=15.0 2023-10-14 19:57:05,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1794244.6666666667, ans=0.125 2023-10-14 19:57:33,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1794291.3333333333, ans=0.0 2023-10-14 19:57:52,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1794338.0, ans=0.125 2023-10-14 19:57:52,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1794338.0, ans=0.125 2023-10-14 19:58:11,507 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2023-10-14 19:58:20,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1794384.6666666667, ans=0.125 2023-10-14 19:58:59,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.02 vs. limit=15.0 2023-10-14 19:59:10,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1794618.0, ans=0.125 2023-10-14 19:59:18,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1794618.0, ans=0.0 2023-10-14 19:59:18,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.884e+02 2.106e+02 2.348e+02 3.386e+02, threshold=4.211e+02, percent-clipped=0.0 2023-10-14 19:59:25,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1794664.6666666667, ans=0.2 2023-10-14 19:59:30,020 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1794711.3333333333, ans=0.1 2023-10-14 19:59:36,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.09 vs. limit=12.0 2023-10-14 19:59:47,882 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1794758.0, ans=0.2 2023-10-14 19:59:58,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1794804.6666666667, ans=0.5 2023-10-14 20:00:09,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1794851.3333333333, ans=0.0 2023-10-14 20:00:31,483 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1794944.6666666667, ans=0.125 2023-10-14 20:00:52,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1795038.0, ans=0.0 2023-10-14 20:00:58,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1795084.6666666667, ans=0.0 2023-10-14 20:00:59,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1795084.6666666667, ans=0.125 2023-10-14 20:01:05,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1795084.6666666667, ans=0.2 2023-10-14 20:01:08,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.861e+02 2.071e+02 2.265e+02 3.523e+02, threshold=4.142e+02, percent-clipped=0.0 2023-10-14 20:01:24,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1795178.0, ans=0.0 2023-10-14 20:01:29,331 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:01:37,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1795224.6666666667, ans=0.0 2023-10-14 20:01:50,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795271.3333333333, ans=0.1 2023-10-14 20:02:26,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1795458.0, ans=0.2 2023-10-14 20:02:32,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1795458.0, ans=0.125 2023-10-14 20:02:37,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-14 20:02:51,575 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.45 vs. limit=15.0 2023-10-14 20:02:58,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.926e+02 2.110e+02 2.307e+02 3.229e+02, threshold=4.219e+02, percent-clipped=0.0 2023-10-14 20:02:59,290 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1795598.0, ans=0.09899494936611666 2023-10-14 20:03:03,544 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1795598.0, ans=0.0 2023-10-14 20:03:03,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1795598.0, ans=0.2 2023-10-14 20:03:05,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795598.0, ans=0.1 2023-10-14 20:03:26,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1795691.3333333333, ans=0.0 2023-10-14 20:03:33,194 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.60 vs. limit=15.0 2023-10-14 20:03:33,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1795738.0, ans=0.0 2023-10-14 20:03:54,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1795831.3333333333, ans=0.0 2023-10-14 20:04:11,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1795878.0, ans=0.125 2023-10-14 20:04:14,562 INFO [train.py:1031] (3/4) Epoch 29, batch 2500, loss[loss=0.1823, simple_loss=0.282, pruned_loss=0.04133, over 16863.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2782, pruned_loss=0.04642, over 23398139.68 frames. ], batch size: 104, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 20:04:44,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.853e+02 1.981e+02 2.229e+02 2.879e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 20:05:30,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1796251.3333333333, ans=0.0 2023-10-14 20:05:44,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1796298.0, ans=0.125 2023-10-14 20:05:49,200 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1796344.6666666667, ans=0.125 2023-10-14 20:05:51,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2023-10-14 20:05:54,019 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:06:18,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1796438.0, ans=0.125 2023-10-14 20:06:31,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1796484.6666666667, ans=0.09899494936611666 2023-10-14 20:06:33,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.894e+02 2.084e+02 2.345e+02 3.546e+02, threshold=4.169e+02, percent-clipped=0.0 2023-10-14 20:06:35,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1796531.3333333333, ans=0.125 2023-10-14 20:06:39,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1796531.3333333333, ans=0.125 2023-10-14 20:06:55,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1796624.6666666667, ans=0.125 2023-10-14 20:07:07,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1796671.3333333333, ans=0.5 2023-10-14 20:07:49,448 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:08:06,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1796904.6666666667, ans=0.125 2023-10-14 20:08:18,292 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.39 vs. limit=15.0 2023-10-14 20:08:25,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.16 vs. limit=22.5 2023-10-14 20:08:26,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.844e+02 1.986e+02 2.209e+02 3.111e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-14 20:08:55,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1797091.3333333333, ans=0.125 2023-10-14 20:09:01,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1797091.3333333333, ans=0.125 2023-10-14 20:09:56,821 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1797324.6666666667, ans=0.0 2023-10-14 20:09:58,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.69 vs. limit=15.0 2023-10-14 20:10:24,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.825e+02 2.008e+02 2.155e+02 3.095e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-14 20:10:27,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1797464.6666666667, ans=0.125 2023-10-14 20:10:44,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1797511.3333333333, ans=0.0 2023-10-14 20:10:53,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1797558.0, ans=0.2 2023-10-14 20:11:13,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1797604.6666666667, ans=0.0 2023-10-14 20:11:24,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1797651.3333333333, ans=0.125 2023-10-14 20:11:39,213 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-10-14 20:11:56,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1797791.3333333333, ans=0.1 2023-10-14 20:12:15,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-14 20:12:25,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1797931.3333333333, ans=0.0 2023-10-14 20:12:25,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.888e+02 2.037e+02 2.237e+02 3.185e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-14 20:12:36,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1797978.0, ans=0.125 2023-10-14 20:13:03,815 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.25 vs. limit=10.0 2023-10-14 20:13:21,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1798164.6666666667, ans=0.0 2023-10-14 20:13:37,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1798211.3333333333, ans=0.125 2023-10-14 20:13:41,654 INFO [train.py:1031] (3/4) Epoch 29, batch 3000, loss[loss=0.1858, simple_loss=0.28, pruned_loss=0.04577, over 16854.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2773, pruned_loss=0.04636, over 25479150.34 frames. ], batch size: 146, lr: 1.19e-03, grad_scale: 16.0 2023-10-14 20:14:01,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1798304.6666666667, ans=0.0 2023-10-14 20:14:05,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1798351.3333333333, ans=10.0 2023-10-14 20:14:15,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.809e+02 1.925e+02 2.139e+02 2.753e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 20:15:11,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1798631.3333333333, ans=0.125 2023-10-14 20:15:13,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798631.3333333333, ans=0.1 2023-10-14 20:15:17,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1798631.3333333333, ans=0.125 2023-10-14 20:15:19,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1798678.0, ans=0.07 2023-10-14 20:15:23,664 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=12.0 2023-10-14 20:15:39,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1798724.6666666667, ans=0.125 2023-10-14 20:15:46,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1798724.6666666667, ans=0.125 2023-10-14 20:15:52,250 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1798771.3333333333, ans=0.125 2023-10-14 20:15:54,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1798771.3333333333, ans=0.125 2023-10-14 20:16:06,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1798818.0, ans=0.125 2023-10-14 20:16:10,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.845e+02 2.043e+02 2.309e+02 3.390e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-14 20:16:26,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1798911.3333333333, ans=0.125 2023-10-14 20:16:34,364 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:16:45,545 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1799004.6666666667, ans=0.125 2023-10-14 20:16:50,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1799004.6666666667, ans=0.125 2023-10-14 20:16:53,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1799051.3333333333, ans=0.07 2023-10-14 20:16:53,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1799051.3333333333, ans=0.125 2023-10-14 20:17:06,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1799098.0, ans=0.0 2023-10-14 20:17:12,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1799098.0, ans=0.0 2023-10-14 20:17:32,667 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.15 vs. limit=15.0 2023-10-14 20:17:54,771 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1799284.6666666667, ans=0.0 2023-10-14 20:18:05,672 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1799331.3333333333, ans=0.125 2023-10-14 20:18:07,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.861e+02 2.022e+02 2.187e+02 2.766e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-14 20:18:27,186 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.72 vs. limit=15.0 2023-10-14 20:18:52,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1799471.3333333333, ans=0.0 2023-10-14 20:19:03,321 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:19:26,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1799658.0, ans=0.5 2023-10-14 20:19:34,550 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-10-14 20:19:42,633 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.85 vs. limit=15.0 2023-10-14 20:19:46,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1799704.6666666667, ans=0.0 2023-10-14 20:19:53,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1799751.3333333333, ans=22.5 2023-10-14 20:19:54,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-10-14 20:19:57,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1799751.3333333333, ans=0.07 2023-10-14 20:20:01,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.869e+02 2.018e+02 2.263e+02 3.119e+02, threshold=4.035e+02, percent-clipped=0.0 2023-10-14 20:20:05,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1799798.0, ans=0.1 2023-10-14 20:20:10,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1799844.6666666667, ans=0.0 2023-10-14 20:20:13,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1799844.6666666667, ans=0.125 2023-10-14 20:20:13,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1799844.6666666667, ans=10.0 2023-10-14 20:20:18,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1799844.6666666667, ans=0.1 2023-10-14 20:20:21,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1799891.3333333333, ans=0.0 2023-10-14 20:20:31,115 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:21:05,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1800031.3333333333, ans=0.0 2023-10-14 20:21:05,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1800031.3333333333, ans=0.125 2023-10-14 20:21:08,016 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-10-14 20:21:19,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.70 vs. limit=10.0 2023-10-14 20:21:33,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1800171.3333333333, ans=0.2 2023-10-14 20:21:41,229 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1800171.3333333333, ans=0.05 2023-10-14 20:21:56,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.848e+02 2.086e+02 2.308e+02 3.288e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 20:21:57,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1800264.6666666667, ans=0.2 2023-10-14 20:22:17,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-10-14 20:22:21,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1800358.0, ans=0.0 2023-10-14 20:22:21,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1800358.0, ans=0.125 2023-10-14 20:22:52,847 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1800498.0, ans=0.125 2023-10-14 20:22:53,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1800498.0, ans=0.0 2023-10-14 20:23:12,017 INFO [train.py:1031] (3/4) Epoch 29, batch 3500, loss[loss=0.1848, simple_loss=0.2779, pruned_loss=0.04586, over 16578.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2773, pruned_loss=0.04653, over 27095940.59 frames. ], batch size: 61, lr: 1.19e-03, grad_scale: 8.0 2023-10-14 20:23:21,656 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1800591.3333333333, ans=0.125 2023-10-14 20:23:22,940 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=15.0 2023-10-14 20:23:37,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1800684.6666666667, ans=0.2 2023-10-14 20:23:47,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.863e+02 2.043e+02 2.264e+02 3.629e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-14 20:24:00,963 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1800778.0, ans=0.0 2023-10-14 20:24:05,918 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-14 20:24:06,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1800824.6666666667, ans=0.0 2023-10-14 20:24:10,337 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.84 vs. limit=10.0 2023-10-14 20:24:17,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1800871.3333333333, ans=0.0 2023-10-14 20:24:57,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1801011.3333333333, ans=0.125 2023-10-14 20:25:06,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1801011.3333333333, ans=0.07 2023-10-14 20:25:12,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1801058.0, ans=10.0 2023-10-14 20:25:23,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.00 vs. limit=10.0 2023-10-14 20:25:32,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1801151.3333333333, ans=0.125 2023-10-14 20:25:46,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.888e+02 2.064e+02 2.407e+02 3.450e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 20:25:48,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1801198.0, ans=0.0 2023-10-14 20:26:08,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1801291.3333333333, ans=0.2 2023-10-14 20:26:45,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1801431.3333333333, ans=0.0 2023-10-14 20:26:52,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1801478.0, ans=0.2 2023-10-14 20:27:04,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1801524.6666666667, ans=0.0 2023-10-14 20:27:11,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1801571.3333333333, ans=0.0 2023-10-14 20:27:41,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.780e+02 1.942e+02 2.129e+02 3.439e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-14 20:27:49,052 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1801711.3333333333, ans=0.125 2023-10-14 20:27:54,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1801711.3333333333, ans=0.125 2023-10-14 20:28:02,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1801758.0, ans=0.125 2023-10-14 20:28:19,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1801804.6666666667, ans=0.0 2023-10-14 20:28:21,767 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.33 vs. limit=10.0 2023-10-14 20:28:26,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1801851.3333333333, ans=0.0 2023-10-14 20:28:30,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1801851.3333333333, ans=0.0 2023-10-14 20:28:44,326 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.98 vs. limit=15.0 2023-10-14 20:29:18,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1802038.0, ans=0.125 2023-10-14 20:29:26,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1802084.6666666667, ans=0.0 2023-10-14 20:29:33,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1802131.3333333333, ans=0.0 2023-10-14 20:29:33,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1802131.3333333333, ans=0.2 2023-10-14 20:29:34,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.799e+02 1.973e+02 2.142e+02 3.160e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-14 20:29:35,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1802131.3333333333, ans=0.125 2023-10-14 20:29:37,152 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-10-14 20:29:41,526 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-14 20:29:42,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1802178.0, ans=0.125 2023-10-14 20:30:33,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1802364.6666666667, ans=0.1 2023-10-14 20:30:57,823 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:31:08,816 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1802551.3333333333, ans=0.125 2023-10-14 20:31:22,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.775e+02 1.963e+02 2.284e+02 3.123e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-14 20:31:27,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-14 20:31:29,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1802644.6666666667, ans=0.1 2023-10-14 20:31:29,593 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-14 20:31:39,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1802644.6666666667, ans=0.0 2023-10-14 20:32:01,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1802738.0, ans=0.0 2023-10-14 20:32:07,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-10-14 20:32:18,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1802831.3333333333, ans=0.95 2023-10-14 20:32:18,867 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-14 20:32:20,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1802831.3333333333, ans=0.0 2023-10-14 20:32:24,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1802878.0, ans=0.0 2023-10-14 20:32:37,526 INFO [train.py:1031] (3/4) Epoch 29, batch 4000, loss[loss=0.1892, simple_loss=0.2745, pruned_loss=0.05191, over 16672.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2772, pruned_loss=0.04655, over 28368552.00 frames. ], batch size: 56, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 20:32:37,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1802924.6666666667, ans=0.125 2023-10-14 20:32:52,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1802971.3333333333, ans=0.125 2023-10-14 20:33:15,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.884e+02 2.058e+02 2.269e+02 3.422e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-14 20:33:25,460 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:33:30,793 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1803111.3333333333, ans=0.0 2023-10-14 20:33:34,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1803158.0, ans=0.125 2023-10-14 20:33:40,245 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.70 vs. limit=10.0 2023-10-14 20:34:09,261 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1803298.0, ans=0.125 2023-10-14 20:34:14,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1803298.0, ans=0.1 2023-10-14 20:34:17,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803298.0, ans=0.1 2023-10-14 20:34:18,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1803344.6666666667, ans=0.2 2023-10-14 20:34:49,823 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.71 vs. limit=15.0 2023-10-14 20:35:09,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.932e+02 2.098e+02 2.393e+02 3.008e+02, threshold=4.196e+02, percent-clipped=0.0 2023-10-14 20:35:16,294 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-14 20:35:20,377 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-10-14 20:35:27,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1803578.0, ans=0.1 2023-10-14 20:35:29,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1803624.6666666667, ans=0.125 2023-10-14 20:35:30,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1803624.6666666667, ans=0.125 2023-10-14 20:35:38,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1803624.6666666667, ans=0.125 2023-10-14 20:36:01,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1803718.0, ans=0.1 2023-10-14 20:36:12,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1803718.0, ans=0.125 2023-10-14 20:36:41,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1803858.0, ans=0.125 2023-10-14 20:36:52,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1803904.6666666667, ans=0.04949747468305833 2023-10-14 20:37:06,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1803951.3333333333, ans=0.125 2023-10-14 20:37:18,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1803998.0, ans=0.125 2023-10-14 20:37:18,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.845e+02 2.013e+02 2.174e+02 3.241e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-14 20:37:23,882 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.32 vs. limit=10.0 2023-10-14 20:37:35,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1804091.3333333333, ans=0.125 2023-10-14 20:37:41,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1804091.3333333333, ans=0.1 2023-10-14 20:37:55,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.90 vs. limit=22.5 2023-10-14 20:38:06,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1804231.3333333333, ans=0.125 2023-10-14 20:38:13,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1804231.3333333333, ans=0.125 2023-10-14 20:38:14,162 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-10-14 20:38:18,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1804278.0, ans=0.125 2023-10-14 20:38:27,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1804324.6666666667, ans=0.0 2023-10-14 20:38:39,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1804371.3333333333, ans=0.1 2023-10-14 20:38:59,002 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=15.0 2023-10-14 20:39:06,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.831e+02 1.976e+02 2.287e+02 2.865e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 20:39:49,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-10-14 20:40:14,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1804744.6666666667, ans=0.0 2023-10-14 20:40:14,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1804744.6666666667, ans=0.0 2023-10-14 20:40:16,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1804744.6666666667, ans=0.0 2023-10-14 20:41:05,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.972e+02 2.127e+02 2.320e+02 3.128e+02, threshold=4.253e+02, percent-clipped=0.0 2023-10-14 20:41:11,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1804931.3333333333, ans=0.2 2023-10-14 20:41:18,949 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1804978.0, ans=0.2 2023-10-14 20:41:38,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1805071.3333333333, ans=0.05 2023-10-14 20:41:43,127 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.24 vs. limit=15.0 2023-10-14 20:41:58,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1805118.0, ans=10.0 2023-10-14 20:42:01,310 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.99 vs. limit=15.0 2023-10-14 20:42:18,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1805211.3333333333, ans=0.2 2023-10-14 20:42:21,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1805211.3333333333, ans=0.0 2023-10-14 20:42:24,005 INFO [train.py:1031] (3/4) Epoch 29, batch 4500, loss[loss=0.178, simple_loss=0.2771, pruned_loss=0.03948, over 16860.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2774, pruned_loss=0.04627, over 29370572.45 frames. ], batch size: 175, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 20:43:01,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.825e+02 1.986e+02 2.219e+02 2.817e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-14 20:43:42,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805584.6666666667, ans=0.1 2023-10-14 20:43:49,877 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1805631.3333333333, ans=0.2 2023-10-14 20:43:51,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1805631.3333333333, ans=0.125 2023-10-14 20:44:00,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-10-14 20:44:14,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1805724.6666666667, ans=0.035 2023-10-14 20:44:16,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805724.6666666667, ans=0.1 2023-10-14 20:44:21,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1805771.3333333333, ans=0.025 2023-10-14 20:44:27,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1805771.3333333333, ans=0.0 2023-10-14 20:44:30,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1805818.0, ans=0.025 2023-10-14 20:44:33,226 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.74 vs. limit=15.0 2023-10-14 20:44:44,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.816e+02 1.994e+02 2.220e+02 3.314e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-14 20:44:50,440 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-10-14 20:44:53,951 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.94 vs. limit=15.0 2023-10-14 20:45:08,413 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-10-14 20:45:23,799 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1806051.3333333333, ans=0.2 2023-10-14 20:45:28,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1806051.3333333333, ans=0.0 2023-10-14 20:45:55,821 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:46:31,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.845e+02 2.040e+02 2.261e+02 3.021e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-14 20:46:43,775 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.34 vs. limit=15.0 2023-10-14 20:46:51,700 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-14 20:46:54,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1806424.6666666667, ans=0.0 2023-10-14 20:47:07,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.64 vs. limit=10.0 2023-10-14 20:47:20,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806564.6666666667, ans=0.1 2023-10-14 20:47:23,386 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2023-10-14 20:47:27,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806611.3333333333, ans=0.1 2023-10-14 20:48:21,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.910e+02 2.032e+02 2.255e+02 3.187e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 20:48:22,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1806798.0, ans=0.125 2023-10-14 20:48:30,631 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-14 20:48:38,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-10-14 20:48:50,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1806938.0, ans=0.2 2023-10-14 20:48:57,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1806938.0, ans=0.125 2023-10-14 20:49:59,861 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:50:05,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1807218.0, ans=0.125 2023-10-14 20:50:12,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1807264.6666666667, ans=0.125 2023-10-14 20:50:16,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1807264.6666666667, ans=0.2 2023-10-14 20:50:17,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.848e+02 2.133e+02 2.322e+02 2.866e+02, threshold=4.266e+02, percent-clipped=0.0 2023-10-14 20:50:25,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.42 vs. limit=15.0 2023-10-14 20:50:34,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807358.0, ans=0.1 2023-10-14 20:50:49,220 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-10-14 20:51:04,619 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:51:09,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1807498.0, ans=0.2 2023-10-14 20:51:17,163 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-10-14 20:51:25,892 INFO [train.py:1031] (3/4) Epoch 29, batch 5000, loss[loss=0.1752, simple_loss=0.2449, pruned_loss=0.0528, over 12738.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2772, pruned_loss=0.04633, over 30145605.33 frames. ], batch size: 440, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 20:52:07,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.847e+02 2.026e+02 2.209e+02 3.172e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-14 20:52:22,472 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1807778.0, ans=0.0 2023-10-14 20:52:28,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1807824.6666666667, ans=0.0 2023-10-14 20:52:53,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1807918.0, ans=0.125 2023-10-14 20:52:54,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1807918.0, ans=0.125 2023-10-14 20:53:29,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1808058.0, ans=0.0 2023-10-14 20:53:42,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1808104.6666666667, ans=0.125 2023-10-14 20:53:44,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1808151.3333333333, ans=0.2 2023-10-14 20:53:52,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-14 20:53:58,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1808198.0, ans=0.125 2023-10-14 20:54:00,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.819e+02 1.968e+02 2.171e+02 3.032e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 20:54:00,895 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1808198.0, ans=0.0 2023-10-14 20:54:04,259 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1808198.0, ans=0.125 2023-10-14 20:54:07,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1808244.6666666667, ans=0.125 2023-10-14 20:54:08,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1808244.6666666667, ans=0.125 2023-10-14 20:54:29,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1808338.0, ans=0.0 2023-10-14 20:54:31,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-10-14 20:54:34,905 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808338.0, ans=0.1 2023-10-14 20:54:38,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-10-14 20:54:42,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1808384.6666666667, ans=0.125 2023-10-14 20:54:56,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1808431.3333333333, ans=0.0 2023-10-14 20:55:00,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1808478.0, ans=0.125 2023-10-14 20:55:12,625 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1808524.6666666667, ans=0.2 2023-10-14 20:55:19,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1808571.3333333333, ans=0.1 2023-10-14 20:55:34,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1808618.0, ans=0.125 2023-10-14 20:55:38,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1808618.0, ans=0.0 2023-10-14 20:55:39,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1808618.0, ans=0.125 2023-10-14 20:55:48,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.841e+02 2.072e+02 2.300e+02 3.120e+02, threshold=4.145e+02, percent-clipped=0.0 2023-10-14 20:55:55,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1808711.3333333333, ans=0.0 2023-10-14 20:55:58,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1808711.3333333333, ans=0.07 2023-10-14 20:56:05,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1808758.0, ans=0.0 2023-10-14 20:56:08,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1808758.0, ans=0.0 2023-10-14 20:56:18,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1808804.6666666667, ans=0.125 2023-10-14 20:56:28,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1808851.3333333333, ans=0.125 2023-10-14 20:56:29,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1808851.3333333333, ans=0.2 2023-10-14 20:56:39,503 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-10-14 20:56:47,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1808898.0, ans=0.015 2023-10-14 20:56:58,469 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1808944.6666666667, ans=0.0 2023-10-14 20:57:22,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1809038.0, ans=0.125 2023-10-14 20:57:27,326 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1809084.6666666667, ans=0.0 2023-10-14 20:57:35,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1809131.3333333333, ans=0.125 2023-10-14 20:57:45,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.802e+02 1.981e+02 2.214e+02 2.648e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 20:57:47,426 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.32 vs. limit=22.5 2023-10-14 20:57:57,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=22.5 2023-10-14 20:58:07,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1809224.6666666667, ans=0.025 2023-10-14 20:58:23,609 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-10-14 20:58:25,289 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1809318.0, ans=0.125 2023-10-14 20:58:35,222 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1809364.6666666667, ans=0.125 2023-10-14 20:58:39,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1809364.6666666667, ans=0.125 2023-10-14 20:58:47,812 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1809411.3333333333, ans=0.125 2023-10-14 20:58:55,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1809458.0, ans=0.0 2023-10-14 20:58:56,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1809458.0, ans=0.125 2023-10-14 20:59:02,533 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-14 20:59:10,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1809504.6666666667, ans=0.125 2023-10-14 20:59:18,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1809551.3333333333, ans=0.1 2023-10-14 20:59:32,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.747e+02 1.887e+02 2.056e+02 3.223e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-14 20:59:45,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1809644.6666666667, ans=0.125 2023-10-14 20:59:46,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1809644.6666666667, ans=0.125 2023-10-14 21:00:08,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1809738.0, ans=0.0 2023-10-14 21:00:08,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1809738.0, ans=0.125 2023-10-14 21:00:13,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1809784.6666666667, ans=0.2 2023-10-14 21:00:35,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1809878.0, ans=0.125 2023-10-14 21:00:36,277 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:00:37,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1809878.0, ans=0.125 2023-10-14 21:00:40,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-10-14 21:00:40,750 INFO [train.py:1031] (3/4) Epoch 29, batch 5500, loss[loss=0.187, simple_loss=0.2769, pruned_loss=0.04855, over 16032.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2769, pruned_loss=0.04608, over 30756464.56 frames. ], batch size: 296, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:00:58,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1809971.3333333333, ans=0.0 2023-10-14 21:01:09,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1810018.0, ans=0.0 2023-10-14 21:01:17,123 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-14 21:01:17,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.834e+02 1.953e+02 2.151e+02 2.674e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-14 21:01:26,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1810111.3333333333, ans=0.125 2023-10-14 21:01:58,567 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:02:05,703 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-14 21:02:08,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1810298.0, ans=0.025 2023-10-14 21:02:26,681 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1810391.3333333333, ans=0.0 2023-10-14 21:02:30,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=12.0 2023-10-14 21:02:34,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1810391.3333333333, ans=0.0 2023-10-14 21:02:39,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1810438.0, ans=0.05 2023-10-14 21:02:39,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1810438.0, ans=0.0 2023-10-14 21:02:40,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1810438.0, ans=0.0 2023-10-14 21:02:46,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1810484.6666666667, ans=0.125 2023-10-14 21:02:59,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1810531.3333333333, ans=0.0 2023-10-14 21:03:03,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.810e+02 1.992e+02 2.250e+02 3.650e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-14 21:03:13,074 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1810578.0, ans=0.125 2023-10-14 21:03:21,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=22.5 2023-10-14 21:03:42,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1810718.0, ans=0.125 2023-10-14 21:03:53,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1810764.6666666667, ans=0.125 2023-10-14 21:04:12,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1810811.3333333333, ans=0.1 2023-10-14 21:04:33,015 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1810904.6666666667, ans=0.0 2023-10-14 21:04:57,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.873e+02 2.032e+02 2.299e+02 3.087e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 21:04:59,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1810998.0, ans=0.125 2023-10-14 21:05:02,991 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2023-10-14 21:05:28,096 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=22.5 2023-10-14 21:05:30,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.79 vs. limit=22.5 2023-10-14 21:05:31,044 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.13 vs. limit=15.0 2023-10-14 21:05:34,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-10-14 21:05:51,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1811231.3333333333, ans=0.2 2023-10-14 21:05:55,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811231.3333333333, ans=0.1 2023-10-14 21:05:58,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1811278.0, ans=0.125 2023-10-14 21:06:02,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1811278.0, ans=0.0 2023-10-14 21:06:03,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1811278.0, ans=0.0 2023-10-14 21:06:12,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-10-14 21:06:16,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1811324.6666666667, ans=0.0 2023-10-14 21:06:24,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1811371.3333333333, ans=0.125 2023-10-14 21:06:40,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1811418.0, ans=0.035 2023-10-14 21:06:49,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.801e+02 1.999e+02 2.250e+02 3.629e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-14 21:07:08,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1811558.0, ans=0.2 2023-10-14 21:07:21,082 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1811604.6666666667, ans=0.125 2023-10-14 21:07:29,738 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1811651.3333333333, ans=0.0 2023-10-14 21:07:33,094 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-10-14 21:07:40,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1811698.0, ans=0.125 2023-10-14 21:07:43,455 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1811698.0, ans=0.125 2023-10-14 21:07:46,243 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1811698.0, ans=0.0 2023-10-14 21:08:00,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1811791.3333333333, ans=0.0 2023-10-14 21:08:08,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.91 vs. limit=15.0 2023-10-14 21:08:19,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1811838.0, ans=0.5 2023-10-14 21:08:19,629 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=22.5 2023-10-14 21:08:23,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811884.6666666667, ans=0.1 2023-10-14 21:08:36,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1811931.3333333333, ans=0.0 2023-10-14 21:08:41,370 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-10-14 21:08:42,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.866e+02 2.032e+02 2.352e+02 4.379e+02, threshold=4.064e+02, percent-clipped=1.0 2023-10-14 21:08:53,279 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:09:03,520 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=15.0 2023-10-14 21:09:04,424 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-10-14 21:09:28,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1812164.6666666667, ans=0.0 2023-10-14 21:09:43,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1812211.3333333333, ans=0.2 2023-10-14 21:09:49,401 INFO [train.py:1031] (3/4) Epoch 29, batch 6000, loss[loss=0.1823, simple_loss=0.2763, pruned_loss=0.04411, over 16591.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2772, pruned_loss=0.04621, over 31234852.96 frames. ], batch size: 56, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 21:09:52,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1812258.0, ans=0.0 2023-10-14 21:10:29,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.877e+02 1.996e+02 2.206e+02 2.771e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 21:10:39,476 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:10:39,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1812444.6666666667, ans=0.125 2023-10-14 21:11:04,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1812584.6666666667, ans=0.2 2023-10-14 21:11:08,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1812584.6666666667, ans=0.125 2023-10-14 21:11:10,008 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:11:15,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-10-14 21:11:49,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=1812771.3333333333, ans=10.0 2023-10-14 21:12:15,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.819e+02 1.970e+02 2.194e+02 2.752e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 21:12:27,989 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-10-14 21:12:47,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1813004.6666666667, ans=0.1 2023-10-14 21:12:50,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=15.0 2023-10-14 21:12:50,170 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.09 vs. limit=10.0 2023-10-14 21:13:18,937 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:13:21,909 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.98 vs. limit=15.0 2023-10-14 21:13:29,289 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-14 21:13:52,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1813284.6666666667, ans=0.0 2023-10-14 21:13:59,121 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1813331.3333333333, ans=0.0 2023-10-14 21:14:04,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.968e+02 2.138e+02 2.510e+02 4.092e+02, threshold=4.276e+02, percent-clipped=1.0 2023-10-14 21:14:15,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1813378.0, ans=0.0 2023-10-14 21:14:18,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1813424.6666666667, ans=0.2 2023-10-14 21:14:26,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1813424.6666666667, ans=0.125 2023-10-14 21:14:42,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-10-14 21:14:54,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1813564.6666666667, ans=0.125 2023-10-14 21:14:55,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1813564.6666666667, ans=0.0 2023-10-14 21:15:00,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1813564.6666666667, ans=0.125 2023-10-14 21:15:04,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1813611.3333333333, ans=0.07 2023-10-14 21:15:15,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1813658.0, ans=0.125 2023-10-14 21:15:29,833 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=15.0 2023-10-14 21:15:44,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1813751.3333333333, ans=0.0 2023-10-14 21:15:56,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.852e+02 2.035e+02 2.232e+02 3.095e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-14 21:16:10,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.36 vs. limit=15.0 2023-10-14 21:16:14,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1813891.3333333333, ans=0.2 2023-10-14 21:16:25,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1813938.0, ans=0.0 2023-10-14 21:16:54,102 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1814031.3333333333, ans=0.0 2023-10-14 21:16:54,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1814031.3333333333, ans=0.0 2023-10-14 21:16:58,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814031.3333333333, ans=0.1 2023-10-14 21:17:15,002 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1814124.6666666667, ans=0.0 2023-10-14 21:17:20,454 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-10-14 21:17:35,425 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=22.5 2023-10-14 21:17:41,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-14 21:17:42,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1814218.0, ans=0.125 2023-10-14 21:17:57,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.851e+02 2.053e+02 2.310e+02 3.189e+02, threshold=4.107e+02, percent-clipped=0.0 2023-10-14 21:18:16,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1814358.0, ans=0.0 2023-10-14 21:18:23,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.45 vs. limit=15.0 2023-10-14 21:18:38,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1814451.3333333333, ans=0.0 2023-10-14 21:18:45,911 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-10-14 21:18:57,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1814544.6666666667, ans=0.125 2023-10-14 21:18:58,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.50 vs. limit=6.0 2023-10-14 21:19:02,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1814544.6666666667, ans=0.125 2023-10-14 21:19:08,621 INFO [train.py:1031] (3/4) Epoch 29, batch 6500, loss[loss=0.207, simple_loss=0.2977, pruned_loss=0.0581, over 16954.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2776, pruned_loss=0.04635, over 31570856.67 frames. ], batch size: 138, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 21:19:13,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1814591.3333333333, ans=0.1 2023-10-14 21:19:15,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814591.3333333333, ans=0.1 2023-10-14 21:19:42,869 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1814684.6666666667, ans=15.0 2023-10-14 21:20:01,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.902e+02 2.090e+02 2.254e+02 4.148e+02, threshold=4.179e+02, percent-clipped=1.0 2023-10-14 21:20:18,479 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1814824.6666666667, ans=0.0 2023-10-14 21:20:29,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1814871.3333333333, ans=0.125 2023-10-14 21:20:35,539 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1814871.3333333333, ans=0.2 2023-10-14 21:20:46,719 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1814918.0, ans=0.0 2023-10-14 21:20:57,926 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-10-14 21:21:10,671 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1815058.0, ans=0.0 2023-10-14 21:21:49,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1815198.0, ans=0.0 2023-10-14 21:21:51,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.881e+02 2.032e+02 2.279e+02 3.221e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-14 21:22:03,931 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1815291.3333333333, ans=0.2 2023-10-14 21:22:53,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1815478.0, ans=0.125 2023-10-14 21:23:11,675 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=22.5 2023-10-14 21:23:26,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1815618.0, ans=0.95 2023-10-14 21:23:35,228 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-10-14 21:23:41,715 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.810e+02 1.936e+02 2.218e+02 2.692e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-14 21:23:53,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1815758.0, ans=0.1 2023-10-14 21:24:29,664 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:24:34,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1815898.0, ans=0.125 2023-10-14 21:24:46,779 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1815944.6666666667, ans=0.125 2023-10-14 21:25:11,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1816038.0, ans=0.125 2023-10-14 21:25:44,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1816131.3333333333, ans=0.125 2023-10-14 21:25:47,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.791e+02 1.994e+02 2.227e+02 4.307e+02, threshold=3.988e+02, percent-clipped=1.0 2023-10-14 21:26:26,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816318.0, ans=0.1 2023-10-14 21:26:45,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1816364.6666666667, ans=0.125 2023-10-14 21:27:08,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816504.6666666667, ans=0.1 2023-10-14 21:27:15,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1816504.6666666667, ans=0.1 2023-10-14 21:27:24,712 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:27:26,859 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-10-14 21:27:39,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.845e+02 2.016e+02 2.288e+02 2.929e+02, threshold=4.033e+02, percent-clipped=0.0 2023-10-14 21:27:49,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1816691.3333333333, ans=0.125 2023-10-14 21:27:57,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1816691.3333333333, ans=22.5 2023-10-14 21:28:04,515 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1816738.0, ans=0.125 2023-10-14 21:28:05,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1816738.0, ans=0.125 2023-10-14 21:28:05,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1816738.0, ans=15.0 2023-10-14 21:28:06,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1816738.0, ans=0.125 2023-10-14 21:28:42,083 INFO [train.py:1031] (3/4) Epoch 29, batch 7000, loss[loss=0.2049, simple_loss=0.2989, pruned_loss=0.05547, over 16887.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.278, pruned_loss=0.04626, over 31859995.01 frames. ], batch size: 155, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:28:42,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1816924.6666666667, ans=0.0 2023-10-14 21:28:43,340 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1816924.6666666667, ans=0.0 2023-10-14 21:28:54,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1816971.3333333333, ans=0.125 2023-10-14 21:29:04,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=12.0 2023-10-14 21:29:27,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.869e+02 2.054e+02 2.300e+02 3.218e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-14 21:29:43,390 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1817158.0, ans=0.0 2023-10-14 21:29:56,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1817204.6666666667, ans=0.125 2023-10-14 21:30:01,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1817251.3333333333, ans=0.125 2023-10-14 21:30:11,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1817298.0, ans=0.0 2023-10-14 21:30:13,270 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-10-14 21:30:22,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1817344.6666666667, ans=0.0 2023-10-14 21:30:42,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1817438.0, ans=0.125 2023-10-14 21:30:58,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1817484.6666666667, ans=0.0 2023-10-14 21:31:11,822 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1817531.3333333333, ans=0.125 2023-10-14 21:31:15,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.872e+02 2.048e+02 2.366e+02 3.478e+02, threshold=4.095e+02, percent-clipped=0.0 2023-10-14 21:31:19,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1817578.0, ans=0.125 2023-10-14 21:31:29,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1817624.6666666667, ans=0.0 2023-10-14 21:31:35,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1817624.6666666667, ans=0.0 2023-10-14 21:31:37,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1817671.3333333333, ans=0.0 2023-10-14 21:31:38,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1817671.3333333333, ans=0.125 2023-10-14 21:31:56,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817764.6666666667, ans=0.1 2023-10-14 21:32:12,128 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=22.5 2023-10-14 21:32:41,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1817904.6666666667, ans=0.0 2023-10-14 21:32:41,418 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1817904.6666666667, ans=0.125 2023-10-14 21:32:43,541 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=15.0 2023-10-14 21:32:57,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1817951.3333333333, ans=0.025 2023-10-14 21:33:01,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=12.0 2023-10-14 21:33:05,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1817998.0, ans=0.2 2023-10-14 21:33:17,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.779e+02 1.932e+02 2.101e+02 2.782e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 21:33:19,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1818044.6666666667, ans=0.2 2023-10-14 21:33:40,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=15.0 2023-10-14 21:33:42,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-10-14 21:33:48,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1818138.0, ans=0.0 2023-10-14 21:33:53,554 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.69 vs. limit=10.0 2023-10-14 21:33:59,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1818184.6666666667, ans=0.1 2023-10-14 21:34:10,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1818231.3333333333, ans=0.07 2023-10-14 21:34:30,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1818324.6666666667, ans=0.07 2023-10-14 21:34:32,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1818324.6666666667, ans=0.125 2023-10-14 21:34:33,756 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-14 21:35:12,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.797e+02 1.929e+02 2.183e+02 2.836e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-14 21:35:20,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1818511.3333333333, ans=0.125 2023-10-14 21:35:21,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1818511.3333333333, ans=0.0 2023-10-14 21:35:39,645 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1818604.6666666667, ans=0.0 2023-10-14 21:35:41,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-10-14 21:35:43,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1818604.6666666667, ans=0.95 2023-10-14 21:35:52,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1818651.3333333333, ans=0.07 2023-10-14 21:35:59,757 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1818698.0, ans=0.125 2023-10-14 21:36:05,662 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1818698.0, ans=0.015 2023-10-14 21:36:23,238 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-10-14 21:36:38,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1818838.0, ans=0.2 2023-10-14 21:36:40,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-10-14 21:36:42,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=22.5 2023-10-14 21:37:01,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1818931.3333333333, ans=0.5 2023-10-14 21:37:04,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.878e+02 2.109e+02 2.518e+02 3.727e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-14 21:37:20,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1819024.6666666667, ans=0.1 2023-10-14 21:37:20,826 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1819024.6666666667, ans=0.2 2023-10-14 21:37:27,308 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1819071.3333333333, ans=0.07 2023-10-14 21:37:27,559 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.62 vs. limit=15.0 2023-10-14 21:37:46,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1819164.6666666667, ans=0.0 2023-10-14 21:38:09,181 INFO [train.py:1031] (3/4) Epoch 29, batch 7500, loss[loss=0.1803, simple_loss=0.2562, pruned_loss=0.05218, over 12345.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2777, pruned_loss=0.04625, over 32027465.73 frames. ], batch size: 440, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:38:19,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.19 vs. limit=15.0 2023-10-14 21:38:23,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1819304.6666666667, ans=0.125 2023-10-14 21:38:29,807 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1819351.3333333333, ans=0.0 2023-10-14 21:38:39,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1819351.3333333333, ans=0.125 2023-10-14 21:38:44,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1819398.0, ans=0.125 2023-10-14 21:38:53,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.878e+02 2.070e+02 2.311e+02 3.269e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-14 21:38:59,441 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1819444.6666666667, ans=0.125 2023-10-14 21:39:04,359 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1819444.6666666667, ans=0.0 2023-10-14 21:39:14,349 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1819491.3333333333, ans=0.125 2023-10-14 21:39:22,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-10-14 21:39:46,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-10-14 21:39:48,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1819631.3333333333, ans=0.125 2023-10-14 21:40:22,784 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1819818.0, ans=0.07 2023-10-14 21:40:37,820 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-10-14 21:40:43,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1819864.6666666667, ans=0.0 2023-10-14 21:40:44,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1819864.6666666667, ans=0.07 2023-10-14 21:40:44,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1819864.6666666667, ans=0.1 2023-10-14 21:40:54,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.75 vs. limit=10.0 2023-10-14 21:40:56,818 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-10-14 21:40:57,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.819e+02 1.912e+02 2.113e+02 3.176e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-14 21:40:59,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1819911.3333333333, ans=0.2 2023-10-14 21:41:02,196 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.22 vs. limit=15.0 2023-10-14 21:41:06,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1819911.3333333333, ans=0.0 2023-10-14 21:41:09,893 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:41:21,614 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.28 vs. limit=22.5 2023-10-14 21:41:31,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1820051.3333333333, ans=0.0 2023-10-14 21:41:32,273 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1820051.3333333333, ans=0.1 2023-10-14 21:41:45,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-14 21:42:00,208 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:42:03,071 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:42:06,908 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2023-10-14 21:42:13,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1820191.3333333333, ans=0.125 2023-10-14 21:42:18,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1820238.0, ans=0.0 2023-10-14 21:42:20,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1820238.0, ans=0.2 2023-10-14 21:42:23,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1820238.0, ans=0.125 2023-10-14 21:42:23,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1820238.0, ans=0.1 2023-10-14 21:42:27,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1820284.6666666667, ans=0.0 2023-10-14 21:42:36,080 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1820284.6666666667, ans=0.1 2023-10-14 21:42:42,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=15.0 2023-10-14 21:42:49,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.819e+02 2.007e+02 2.226e+02 3.209e+02, threshold=4.014e+02, percent-clipped=0.0 2023-10-14 21:43:12,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1820471.3333333333, ans=0.0 2023-10-14 21:43:19,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1820518.0, ans=0.2 2023-10-14 21:43:27,380 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1820518.0, ans=0.0 2023-10-14 21:44:01,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1820704.6666666667, ans=0.125 2023-10-14 21:44:02,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-10-14 21:44:03,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1820704.6666666667, ans=10.0 2023-10-14 21:44:03,433 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1820704.6666666667, ans=0.125 2023-10-14 21:44:07,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1820704.6666666667, ans=0.2 2023-10-14 21:44:29,939 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1820798.0, ans=0.5 2023-10-14 21:44:43,055 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1820844.6666666667, ans=0.125 2023-10-14 21:44:43,584 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.923e+02 2.075e+02 2.342e+02 3.631e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-14 21:44:44,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1820844.6666666667, ans=0.125 2023-10-14 21:44:46,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1820844.6666666667, ans=0.0 2023-10-14 21:44:57,565 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1820891.3333333333, ans=0.05 2023-10-14 21:45:03,276 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.80 vs. limit=22.5 2023-10-14 21:45:07,005 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1820938.0, ans=0.125 2023-10-14 21:45:13,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1820938.0, ans=0.04949747468305833 2023-10-14 21:45:30,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1821031.3333333333, ans=0.2 2023-10-14 21:45:44,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1821078.0, ans=0.0 2023-10-14 21:46:08,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1821171.3333333333, ans=0.0 2023-10-14 21:46:22,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1821218.0, ans=0.0 2023-10-14 21:46:33,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1821264.6666666667, ans=0.125 2023-10-14 21:46:37,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.775e+02 1.901e+02 2.129e+02 2.591e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-14 21:46:56,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1821358.0, ans=0.0 2023-10-14 21:47:02,417 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.52 vs. limit=22.5 2023-10-14 21:47:05,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1821404.6666666667, ans=0.1 2023-10-14 21:47:16,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1821451.3333333333, ans=0.1 2023-10-14 21:47:19,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1821451.3333333333, ans=0.0 2023-10-14 21:47:35,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1821544.6666666667, ans=0.0 2023-10-14 21:47:40,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1821544.6666666667, ans=0.035 2023-10-14 21:47:44,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1821544.6666666667, ans=0.125 2023-10-14 21:47:45,765 INFO [train.py:1031] (3/4) Epoch 29, batch 8000, loss[loss=0.1751, simple_loss=0.2734, pruned_loss=0.03843, over 16914.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2773, pruned_loss=0.0459, over 32208885.43 frames. ], batch size: 72, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:47:49,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1821591.3333333333, ans=0.125 2023-10-14 21:47:53,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1821591.3333333333, ans=0.125 2023-10-14 21:47:56,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1821638.0, ans=0.04949747468305833 2023-10-14 21:48:23,251 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1821731.3333333333, ans=0.125 2023-10-14 21:48:26,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1821731.3333333333, ans=0.125 2023-10-14 21:48:29,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.760e+02 1.932e+02 2.230e+02 3.383e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 21:48:39,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1821824.6666666667, ans=0.125 2023-10-14 21:48:59,037 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2023-10-14 21:49:03,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1821918.0, ans=0.0 2023-10-14 21:49:17,649 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.64 vs. limit=15.0 2023-10-14 21:49:30,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1822011.3333333333, ans=0.125 2023-10-14 21:49:31,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1822058.0, ans=0.125 2023-10-14 21:49:38,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1822058.0, ans=0.125 2023-10-14 21:49:54,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1822151.3333333333, ans=0.125 2023-10-14 21:50:16,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.810e+02 1.954e+02 2.172e+02 2.963e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-14 21:50:39,697 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.41 vs. limit=15.0 2023-10-14 21:50:49,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1822338.0, ans=0.125 2023-10-14 21:50:58,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1822338.0, ans=0.07 2023-10-14 21:51:07,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1822384.6666666667, ans=0.125 2023-10-14 21:51:07,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1822384.6666666667, ans=0.0 2023-10-14 21:51:14,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1822431.3333333333, ans=0.125 2023-10-14 21:51:27,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1822478.0, ans=0.04949747468305833 2023-10-14 21:51:32,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1822478.0, ans=0.125 2023-10-14 21:51:44,246 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1822524.6666666667, ans=0.0 2023-10-14 21:51:51,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1822571.3333333333, ans=0.1 2023-10-14 21:52:04,626 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1822618.0, ans=0.05 2023-10-14 21:52:06,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1822618.0, ans=0.1 2023-10-14 21:52:12,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=22.5 2023-10-14 21:52:23,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.833e+02 1.986e+02 2.159e+02 3.578e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-14 21:52:25,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1822711.3333333333, ans=0.125 2023-10-14 21:52:53,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1822851.3333333333, ans=0.0 2023-10-14 21:52:58,058 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:53:00,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1822851.3333333333, ans=0.125 2023-10-14 21:53:02,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1822851.3333333333, ans=0.0 2023-10-14 21:53:39,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1823038.0, ans=0.125 2023-10-14 21:53:49,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1823038.0, ans=0.0 2023-10-14 21:53:51,184 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=12.0 2023-10-14 21:54:02,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1823131.3333333333, ans=0.125 2023-10-14 21:54:10,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1823131.3333333333, ans=0.125 2023-10-14 21:54:17,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.910e+02 2.081e+02 2.413e+02 3.419e+02, threshold=4.162e+02, percent-clipped=0.0 2023-10-14 21:54:32,099 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2023-10-14 21:54:46,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1823271.3333333333, ans=0.125 2023-10-14 21:55:23,066 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823458.0, ans=0.1 2023-10-14 21:55:49,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1823551.3333333333, ans=0.125 2023-10-14 21:55:56,499 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.21 vs. limit=22.5 2023-10-14 21:56:03,574 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1823598.0, ans=0.05 2023-10-14 21:56:13,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.876e+02 2.016e+02 2.193e+02 2.736e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-14 21:56:16,150 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1823644.6666666667, ans=0.125 2023-10-14 21:56:32,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1823738.0, ans=0.125 2023-10-14 21:56:39,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823738.0, ans=0.1 2023-10-14 21:56:45,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1823784.6666666667, ans=0.0 2023-10-14 21:56:52,670 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1823831.3333333333, ans=0.025 2023-10-14 21:57:06,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1823878.0, ans=0.2 2023-10-14 21:57:12,712 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1823878.0, ans=0.0 2023-10-14 21:57:18,853 INFO [train.py:1031] (3/4) Epoch 29, batch 8500, loss[loss=0.1797, simple_loss=0.2761, pruned_loss=0.04171, over 16868.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2777, pruned_loss=0.0459, over 32355369.89 frames. ], batch size: 72, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 21:57:23,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1823924.6666666667, ans=0.125 2023-10-14 21:57:53,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1824064.6666666667, ans=0.025 2023-10-14 21:57:58,224 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1824064.6666666667, ans=0.0 2023-10-14 21:58:04,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.898e+02 2.013e+02 2.180e+02 2.810e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-14 21:58:33,296 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=15.0 2023-10-14 21:59:05,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1824344.6666666667, ans=0.125 2023-10-14 21:59:18,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1824391.3333333333, ans=0.5 2023-10-14 21:59:25,761 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-10-14 21:59:33,045 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.07 vs. limit=22.5 2023-10-14 21:59:44,202 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1824484.6666666667, ans=0.0 2023-10-14 21:59:46,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1824484.6666666667, ans=0.09899494936611666 2023-10-14 22:00:03,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-14 22:00:04,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.851e+02 2.037e+02 2.320e+02 3.265e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-14 22:00:16,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1824624.6666666667, ans=0.02 2023-10-14 22:00:21,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1824624.6666666667, ans=0.125 2023-10-14 22:00:27,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1824671.3333333333, ans=10.0 2023-10-14 22:00:45,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1824718.0, ans=0.1 2023-10-14 22:01:09,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1824811.3333333333, ans=0.1 2023-10-14 22:01:16,399 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1824858.0, ans=0.0 2023-10-14 22:01:17,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1824858.0, ans=0.1 2023-10-14 22:01:36,935 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1824904.6666666667, ans=0.1 2023-10-14 22:01:59,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1824998.0, ans=0.2 2023-10-14 22:02:04,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.803e+02 1.976e+02 2.193e+02 2.931e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-14 22:02:16,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1825091.3333333333, ans=0.125 2023-10-14 22:02:16,660 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.18 vs. limit=15.0 2023-10-14 22:02:20,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1825091.3333333333, ans=0.0 2023-10-14 22:02:20,306 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825091.3333333333, ans=0.1 2023-10-14 22:02:33,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1825138.0, ans=0.0 2023-10-14 22:02:52,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1825231.3333333333, ans=0.0 2023-10-14 22:03:20,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1825324.6666666667, ans=0.125 2023-10-14 22:03:20,397 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-10-14 22:03:34,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1825418.0, ans=0.0 2023-10-14 22:03:51,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825464.6666666667, ans=0.1 2023-10-14 22:03:55,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1825511.3333333333, ans=0.125 2023-10-14 22:03:57,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.767e+02 2.019e+02 2.182e+02 2.934e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 22:03:57,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1825511.3333333333, ans=0.125 2023-10-14 22:04:02,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1825511.3333333333, ans=0.2 2023-10-14 22:04:39,718 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1825698.0, ans=0.125 2023-10-14 22:04:40,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1825698.0, ans=0.0 2023-10-14 22:05:04,629 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1825791.3333333333, ans=0.0 2023-10-14 22:05:06,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1825791.3333333333, ans=0.2 2023-10-14 22:05:08,262 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1825791.3333333333, ans=0.125 2023-10-14 22:05:09,231 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1825791.3333333333, ans=0.09899494936611666 2023-10-14 22:05:46,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.912e+02 2.083e+02 2.341e+02 3.545e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-14 22:06:01,369 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826024.6666666667, ans=0.1 2023-10-14 22:06:13,028 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-10-14 22:06:27,751 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1826164.6666666667, ans=0.1 2023-10-14 22:06:46,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1826211.3333333333, ans=0.0 2023-10-14 22:06:49,822 INFO [train.py:1031] (3/4) Epoch 29, batch 9000, loss[loss=0.1948, simple_loss=0.293, pruned_loss=0.04832, over 16653.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2772, pruned_loss=0.0459, over 32447754.93 frames. ], batch size: 202, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 22:07:03,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2023-10-14 22:07:16,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826351.3333333333, ans=0.1 2023-10-14 22:07:33,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.819e+02 2.009e+02 2.214e+02 3.178e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-14 22:07:44,047 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826491.3333333333, ans=0.1 2023-10-14 22:07:47,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1826491.3333333333, ans=0.2 2023-10-14 22:07:49,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1826491.3333333333, ans=0.125 2023-10-14 22:07:53,759 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1826538.0, ans=0.1 2023-10-14 22:08:04,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1826584.6666666667, ans=0.0 2023-10-14 22:08:06,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1826584.6666666667, ans=0.125 2023-10-14 22:08:11,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1826584.6666666667, ans=0.0 2023-10-14 22:08:16,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1826631.3333333333, ans=0.125 2023-10-14 22:08:26,733 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:08:40,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-10-14 22:08:47,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1826771.3333333333, ans=0.125 2023-10-14 22:08:52,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826771.3333333333, ans=0.1 2023-10-14 22:08:53,621 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1826771.3333333333, ans=0.0 2023-10-14 22:08:55,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.65 vs. limit=15.0 2023-10-14 22:09:06,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1826864.6666666667, ans=0.0 2023-10-14 22:09:07,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1826864.6666666667, ans=0.125 2023-10-14 22:09:07,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1826864.6666666667, ans=0.125 2023-10-14 22:09:07,788 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=22.5 2023-10-14 22:09:13,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1826864.6666666667, ans=0.125 2023-10-14 22:09:19,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.805e+02 2.014e+02 2.223e+02 2.672e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-14 22:09:38,708 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.95 vs. limit=22.5 2023-10-14 22:09:51,052 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.68 vs. limit=15.0 2023-10-14 22:09:52,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1827051.3333333333, ans=0.0 2023-10-14 22:09:57,513 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.73 vs. limit=15.0 2023-10-14 22:10:02,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1827098.0, ans=15.0 2023-10-14 22:10:02,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=15.0 2023-10-14 22:10:06,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1827098.0, ans=0.125 2023-10-14 22:10:12,605 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.18 vs. limit=10.0 2023-10-14 22:10:13,592 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.65 vs. limit=15.0 2023-10-14 22:10:15,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1827144.6666666667, ans=0.125 2023-10-14 22:10:26,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1827191.3333333333, ans=0.125 2023-10-14 22:10:51,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1827331.3333333333, ans=0.1 2023-10-14 22:10:52,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1827331.3333333333, ans=0.125 2023-10-14 22:11:04,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.920e+02 2.063e+02 2.269e+02 4.021e+02, threshold=4.126e+02, percent-clipped=0.0 2023-10-14 22:11:14,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1827424.6666666667, ans=0.125 2023-10-14 22:11:22,406 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2023-10-14 22:11:26,193 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:11:39,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-10-14 22:12:07,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1827658.0, ans=0.125 2023-10-14 22:12:11,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1827658.0, ans=15.0 2023-10-14 22:12:24,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1827704.6666666667, ans=0.05 2023-10-14 22:12:40,601 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1827798.0, ans=0.125 2023-10-14 22:12:58,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.986e+02 2.202e+02 2.442e+02 3.242e+02, threshold=4.403e+02, percent-clipped=0.0 2023-10-14 22:13:27,095 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2023-10-14 22:13:39,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1827984.6666666667, ans=0.0 2023-10-14 22:13:57,501 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1828078.0, ans=0.125 2023-10-14 22:14:09,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1828124.6666666667, ans=0.1 2023-10-14 22:14:13,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1828124.6666666667, ans=0.2 2023-10-14 22:14:15,049 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1828124.6666666667, ans=0.125 2023-10-14 22:14:19,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1828171.3333333333, ans=0.0 2023-10-14 22:14:25,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1828171.3333333333, ans=0.0 2023-10-14 22:14:31,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1828218.0, ans=0.0 2023-10-14 22:14:33,491 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.44 vs. limit=10.0 2023-10-14 22:14:42,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1828264.6666666667, ans=0.0 2023-10-14 22:14:43,281 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1828264.6666666667, ans=0.0 2023-10-14 22:14:43,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1828264.6666666667, ans=0.0 2023-10-14 22:14:58,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.889e+02 2.072e+02 2.295e+02 3.905e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-14 22:14:58,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1828311.3333333333, ans=0.95 2023-10-14 22:14:59,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1828311.3333333333, ans=0.0 2023-10-14 22:15:13,862 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1828358.0, ans=0.125 2023-10-14 22:15:15,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.62 vs. limit=22.5 2023-10-14 22:15:33,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1828451.3333333333, ans=0.0 2023-10-14 22:15:41,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1828498.0, ans=0.125 2023-10-14 22:15:54,145 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1828544.6666666667, ans=0.125 2023-10-14 22:16:02,843 INFO [train.py:1031] (3/4) Epoch 29, batch 9500, loss[loss=0.1945, simple_loss=0.2895, pruned_loss=0.04977, over 16874.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.278, pruned_loss=0.04606, over 32541374.31 frames. ], batch size: 110, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 22:16:27,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1828684.6666666667, ans=0.04949747468305833 2023-10-14 22:16:33,080 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:16:40,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1828731.3333333333, ans=0.125 2023-10-14 22:16:52,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.854e+02 2.043e+02 2.264e+02 3.097e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-14 22:16:53,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1828778.0, ans=0.2 2023-10-14 22:16:54,719 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2023-10-14 22:16:58,644 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=1828824.6666666667, ans=12.0 2023-10-14 22:17:27,105 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-14 22:17:37,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1828964.6666666667, ans=0.125 2023-10-14 22:17:56,352 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1829058.0, ans=0.0 2023-10-14 22:18:10,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1829104.6666666667, ans=0.1 2023-10-14 22:18:18,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-10-14 22:18:22,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1829151.3333333333, ans=0.125 2023-10-14 22:18:23,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1829151.3333333333, ans=0.0 2023-10-14 22:18:30,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1829198.0, ans=0.0 2023-10-14 22:18:48,020 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.848e+02 2.037e+02 2.256e+02 4.855e+02, threshold=4.073e+02, percent-clipped=1.0 2023-10-14 22:18:54,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1829291.3333333333, ans=0.1 2023-10-14 22:19:22,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1829384.6666666667, ans=0.125 2023-10-14 22:19:30,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1829431.3333333333, ans=0.125 2023-10-14 22:19:33,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1829431.3333333333, ans=0.125 2023-10-14 22:19:39,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1829478.0, ans=0.0 2023-10-14 22:19:54,200 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2023-10-14 22:19:55,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1829524.6666666667, ans=0.0 2023-10-14 22:20:39,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.822e+02 2.041e+02 2.250e+02 3.166e+02, threshold=4.082e+02, percent-clipped=0.0 2023-10-14 22:21:32,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1829944.6666666667, ans=22.5 2023-10-14 22:21:35,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1829944.6666666667, ans=0.0 2023-10-14 22:22:02,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2023-10-14 22:22:16,761 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1830131.3333333333, ans=0.035 2023-10-14 22:22:24,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830131.3333333333, ans=0.1 2023-10-14 22:22:24,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1830131.3333333333, ans=0.125 2023-10-14 22:22:31,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.818e+02 2.024e+02 2.308e+02 3.372e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-14 22:22:34,218 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1830178.0, ans=0.125 2023-10-14 22:22:46,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-10-14 22:22:49,930 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830271.3333333333, ans=0.1 2023-10-14 22:23:04,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1830318.0, ans=0.0 2023-10-14 22:23:08,132 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1830364.6666666667, ans=0.2 2023-10-14 22:23:10,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830364.6666666667, ans=0.1 2023-10-14 22:23:15,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830364.6666666667, ans=0.1 2023-10-14 22:23:33,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1830458.0, ans=0.125 2023-10-14 22:23:34,546 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1830458.0, ans=0.125 2023-10-14 22:23:37,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1830458.0, ans=0.0 2023-10-14 22:23:38,228 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1830458.0, ans=0.125 2023-10-14 22:23:39,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1830458.0, ans=0.0 2023-10-14 22:23:46,385 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1830504.6666666667, ans=0.0 2023-10-14 22:23:46,448 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1830504.6666666667, ans=0.125 2023-10-14 22:23:47,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1830504.6666666667, ans=0.125 2023-10-14 22:23:58,514 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1830551.3333333333, ans=0.1 2023-10-14 22:23:58,713 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1830551.3333333333, ans=0.0 2023-10-14 22:24:03,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1830598.0, ans=0.0 2023-10-14 22:24:19,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.851e+02 1.996e+02 2.210e+02 3.664e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 22:24:32,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1830691.3333333333, ans=0.125 2023-10-14 22:24:40,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1830738.0, ans=0.1 2023-10-14 22:24:43,177 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.45 vs. limit=6.0 2023-10-14 22:24:57,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1830831.3333333333, ans=0.125 2023-10-14 22:24:59,922 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2023-10-14 22:25:17,801 INFO [train.py:1031] (3/4) Epoch 29, batch 10000, loss[loss=0.1919, simple_loss=0.2823, pruned_loss=0.05074, over 16989.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2771, pruned_loss=0.04576, over 32582201.36 frames. ], batch size: 123, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 22:25:30,944 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=5.52 vs. limit=15.0 2023-10-14 22:25:35,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1830971.3333333333, ans=0.0 2023-10-14 22:25:37,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1830971.3333333333, ans=0.0 2023-10-14 22:25:44,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1831018.0, ans=0.125 2023-10-14 22:25:46,799 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.63 vs. limit=15.0 2023-10-14 22:25:52,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831064.6666666667, ans=0.1 2023-10-14 22:25:53,647 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1831064.6666666667, ans=0.0 2023-10-14 22:26:00,652 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831111.3333333333, ans=0.1 2023-10-14 22:26:06,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.874e+02 2.050e+02 2.392e+02 3.302e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-14 22:26:17,081 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1831158.0, ans=0.0 2023-10-14 22:26:52,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831298.0, ans=0.1 2023-10-14 22:27:14,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1831391.3333333333, ans=0.125 2023-10-14 22:27:21,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1831438.0, ans=0.0 2023-10-14 22:27:28,505 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.11 vs. limit=15.0 2023-10-14 22:27:42,776 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-10-14 22:27:57,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.891e+02 2.046e+02 2.257e+02 2.931e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-14 22:27:58,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1831578.0, ans=0.0 2023-10-14 22:27:58,605 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1831578.0, ans=0.0 2023-10-14 22:28:26,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-10-14 22:28:52,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1831811.3333333333, ans=0.0 2023-10-14 22:28:56,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1831811.3333333333, ans=0.125 2023-10-14 22:29:02,271 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-10-14 22:29:21,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1831904.6666666667, ans=0.125 2023-10-14 22:29:24,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1831904.6666666667, ans=0.125 2023-10-14 22:29:35,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1831998.0, ans=0.125 2023-10-14 22:29:42,394 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=15.0 2023-10-14 22:29:50,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1832044.6666666667, ans=0.125 2023-10-14 22:29:50,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=22.5 2023-10-14 22:29:51,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.868e+02 2.041e+02 2.283e+02 3.467e+02, threshold=4.083e+02, percent-clipped=0.0 2023-10-14 22:30:13,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1832138.0, ans=0.1 2023-10-14 22:30:33,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1832231.3333333333, ans=0.0 2023-10-14 22:31:00,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1832324.6666666667, ans=0.0 2023-10-14 22:31:00,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1832324.6666666667, ans=0.0 2023-10-14 22:31:07,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1832371.3333333333, ans=0.07 2023-10-14 22:31:14,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1832371.3333333333, ans=0.0 2023-10-14 22:31:22,447 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-14 22:31:46,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.824e+02 1.973e+02 2.191e+02 2.952e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-14 22:31:53,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1832558.0, ans=0.125 2023-10-14 22:31:58,527 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832558.0, ans=0.125 2023-10-14 22:32:03,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1832558.0, ans=0.125 2023-10-14 22:32:08,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=15.0 2023-10-14 22:32:10,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1832604.6666666667, ans=0.125 2023-10-14 22:32:20,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1832651.3333333333, ans=0.025 2023-10-14 22:33:05,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1832838.0, ans=0.125 2023-10-14 22:33:21,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1832884.6666666667, ans=0.125 2023-10-14 22:33:32,890 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.93 vs. limit=22.5 2023-10-14 22:33:40,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1832978.0, ans=0.125 2023-10-14 22:33:45,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1832978.0, ans=0.125 2023-10-14 22:33:46,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.801e+02 1.923e+02 2.113e+02 2.761e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-14 22:33:53,470 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1833024.6666666667, ans=0.09899494936611666 2023-10-14 22:33:57,265 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.11 vs. limit=15.0 2023-10-14 22:33:59,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1833024.6666666667, ans=0.0 2023-10-14 22:34:15,686 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:34:23,336 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:34:29,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-10-14 22:34:45,943 INFO [train.py:1031] (3/4) Epoch 29, batch 10500, loss[loss=0.1732, simple_loss=0.2744, pruned_loss=0.03602, over 16711.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2776, pruned_loss=0.04588, over 32624321.76 frames. ], batch size: 81, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 22:34:49,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1833258.0, ans=0.0 2023-10-14 22:35:08,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1833351.3333333333, ans=0.0 2023-10-14 22:35:17,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1833398.0, ans=10.0 2023-10-14 22:35:21,966 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1833398.0, ans=0.1 2023-10-14 22:35:31,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.796e+02 1.987e+02 2.179e+02 2.839e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-14 22:35:35,050 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-14 22:35:56,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1833538.0, ans=0.0 2023-10-14 22:36:01,057 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-10-14 22:36:12,584 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1833584.6666666667, ans=0.125 2023-10-14 22:36:21,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1833631.3333333333, ans=0.125 2023-10-14 22:36:34,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1833678.0, ans=0.1 2023-10-14 22:36:39,687 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:36:39,746 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1833724.6666666667, ans=0.125 2023-10-14 22:36:48,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1833724.6666666667, ans=0.125 2023-10-14 22:37:26,021 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-14 22:37:30,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.875e+02 2.010e+02 2.189e+02 3.350e+02, threshold=4.021e+02, percent-clipped=0.0 2023-10-14 22:37:31,271 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1833911.3333333333, ans=0.2 2023-10-14 22:37:35,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-10-14 22:37:42,130 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:37:42,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1833958.0, ans=0.125 2023-10-14 22:37:45,309 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.27 vs. limit=15.0 2023-10-14 22:37:48,875 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.70 vs. limit=10.0 2023-10-14 22:38:07,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-10-14 22:38:21,908 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:38:26,446 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834144.6666666667, ans=0.1 2023-10-14 22:38:29,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1834144.6666666667, ans=0.125 2023-10-14 22:38:38,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-10-14 22:39:01,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1834284.6666666667, ans=0.125 2023-10-14 22:39:16,849 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=22.5 2023-10-14 22:39:25,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.878e+02 1.982e+02 2.239e+02 3.297e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 22:39:34,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1834424.6666666667, ans=0.125 2023-10-14 22:39:57,766 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1834518.0, ans=0.02 2023-10-14 22:39:59,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1834518.0, ans=0.125 2023-10-14 22:40:09,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1834564.6666666667, ans=0.125 2023-10-14 22:40:10,841 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1834564.6666666667, ans=0.07 2023-10-14 22:40:13,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1834564.6666666667, ans=0.1 2023-10-14 22:40:18,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1834611.3333333333, ans=0.0 2023-10-14 22:40:38,630 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:40:42,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1834704.6666666667, ans=0.0 2023-10-14 22:40:46,235 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1834704.6666666667, ans=0.95 2023-10-14 22:40:47,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1834704.6666666667, ans=0.1 2023-10-14 22:40:51,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1834751.3333333333, ans=0.0 2023-10-14 22:40:56,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1834751.3333333333, ans=0.0 2023-10-14 22:41:16,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.981e+02 2.115e+02 2.438e+02 2.977e+02, threshold=4.230e+02, percent-clipped=0.0 2023-10-14 22:41:44,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1834984.6666666667, ans=0.125 2023-10-14 22:41:48,628 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1834984.6666666667, ans=0.0 2023-10-14 22:42:04,317 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835031.3333333333, ans=0.1 2023-10-14 22:42:28,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1835171.3333333333, ans=0.2 2023-10-14 22:42:39,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1835171.3333333333, ans=0.125 2023-10-14 22:42:46,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1835218.0, ans=0.125 2023-10-14 22:42:47,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1835218.0, ans=0.125 2023-10-14 22:42:57,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1835264.6666666667, ans=0.1 2023-10-14 22:43:10,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.771e+02 1.861e+02 2.051e+02 3.057e+02, threshold=3.722e+02, percent-clipped=0.0 2023-10-14 22:43:23,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1835358.0, ans=0.0 2023-10-14 22:43:24,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1835404.6666666667, ans=0.0 2023-10-14 22:43:29,777 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1835404.6666666667, ans=0.125 2023-10-14 22:43:42,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1835451.3333333333, ans=0.125 2023-10-14 22:43:42,830 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835451.3333333333, ans=0.1 2023-10-14 22:43:44,029 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1835451.3333333333, ans=0.2 2023-10-14 22:43:53,199 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1835498.0, ans=0.0 2023-10-14 22:44:10,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.16 vs. limit=15.0 2023-10-14 22:44:10,691 INFO [train.py:1031] (3/4) Epoch 29, batch 11000, loss[loss=0.1985, simple_loss=0.2941, pruned_loss=0.05139, over 16890.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2776, pruned_loss=0.0459, over 32637171.29 frames. ], batch size: 165, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 22:44:11,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1835591.3333333333, ans=0.05 2023-10-14 22:44:27,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-10-14 22:44:31,699 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1835684.6666666667, ans=0.2 2023-10-14 22:44:35,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835684.6666666667, ans=0.1 2023-10-14 22:44:41,207 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.91 vs. limit=10.0 2023-10-14 22:44:57,933 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2023-10-14 22:45:02,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.933e+02 2.132e+02 2.407e+02 3.591e+02, threshold=4.264e+02, percent-clipped=0.0 2023-10-14 22:45:09,990 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-10-14 22:45:14,592 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1835824.6666666667, ans=0.035 2023-10-14 22:45:14,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1835824.6666666667, ans=0.1 2023-10-14 22:45:24,891 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1835871.3333333333, ans=0.125 2023-10-14 22:45:33,088 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=15.0 2023-10-14 22:45:40,857 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1835918.0, ans=0.025 2023-10-14 22:45:54,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1836011.3333333333, ans=0.5 2023-10-14 22:46:03,242 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.50 vs. limit=5.0 2023-10-14 22:46:05,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1836011.3333333333, ans=0.125 2023-10-14 22:46:11,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1836058.0, ans=0.0 2023-10-14 22:46:15,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1836058.0, ans=0.125 2023-10-14 22:46:39,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1836151.3333333333, ans=0.0 2023-10-14 22:46:56,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.27 vs. limit=15.0 2023-10-14 22:47:08,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.723e+02 1.894e+02 2.074e+02 3.359e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-14 22:47:30,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1836338.0, ans=0.0 2023-10-14 22:47:45,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836431.3333333333, ans=0.1 2023-10-14 22:48:04,984 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836478.0, ans=0.1 2023-10-14 22:48:22,498 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1836571.3333333333, ans=0.2 2023-10-14 22:48:26,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1836571.3333333333, ans=0.125 2023-10-14 22:48:27,069 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1836618.0, ans=0.2 2023-10-14 22:48:30,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1836618.0, ans=0.0 2023-10-14 22:48:31,868 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.84 vs. limit=22.5 2023-10-14 22:48:51,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1836711.3333333333, ans=0.0 2023-10-14 22:48:56,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.861e+02 2.064e+02 2.307e+02 3.526e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 22:49:08,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1836758.0, ans=0.125 2023-10-14 22:49:33,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1836851.3333333333, ans=0.0 2023-10-14 22:49:41,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1836898.0, ans=0.025 2023-10-14 22:49:48,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1836898.0, ans=0.125 2023-10-14 22:49:48,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1836898.0, ans=22.5 2023-10-14 22:49:49,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1836898.0, ans=0.125 2023-10-14 22:50:03,006 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-10-14 22:50:06,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1836991.3333333333, ans=0.07 2023-10-14 22:50:26,203 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1837084.6666666667, ans=0.125 2023-10-14 22:50:29,516 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=15.0 2023-10-14 22:50:31,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1837084.6666666667, ans=0.0 2023-10-14 22:50:36,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1837131.3333333333, ans=0.0 2023-10-14 22:50:53,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.929e+02 2.185e+02 2.419e+02 3.198e+02, threshold=4.369e+02, percent-clipped=0.0 2023-10-14 22:51:06,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1837224.6666666667, ans=0.125 2023-10-14 22:51:25,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837318.0, ans=0.1 2023-10-14 22:51:32,962 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1837364.6666666667, ans=0.0 2023-10-14 22:51:52,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1837458.0, ans=0.125 2023-10-14 22:51:57,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1837458.0, ans=0.0 2023-10-14 22:52:08,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837504.6666666667, ans=0.1 2023-10-14 22:52:23,460 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1837551.3333333333, ans=0.0 2023-10-14 22:52:28,787 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837598.0, ans=0.1 2023-10-14 22:52:31,979 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1837598.0, ans=0.125 2023-10-14 22:52:37,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1837644.6666666667, ans=0.125 2023-10-14 22:52:37,820 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1837644.6666666667, ans=0.125 2023-10-14 22:52:45,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.05 vs. limit=15.0 2023-10-14 22:52:45,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.877e+02 2.021e+02 2.337e+02 3.056e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 22:53:06,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1837738.0, ans=0.5 2023-10-14 22:53:39,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.40 vs. limit=15.0 2023-10-14 22:53:39,687 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1837878.0, ans=0.125 2023-10-14 22:53:42,762 INFO [train.py:1031] (3/4) Epoch 29, batch 11500, loss[loss=0.1992, simple_loss=0.2968, pruned_loss=0.0508, over 16852.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2772, pruned_loss=0.04565, over 32673395.00 frames. ], batch size: 87, lr: 1.17e-03, grad_scale: 16.0 2023-10-14 22:53:45,104 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-14 22:53:58,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1837971.3333333333, ans=0.1 2023-10-14 22:54:34,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.861e+02 2.058e+02 2.262e+02 3.439e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-14 22:54:36,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1838111.3333333333, ans=0.0 2023-10-14 22:54:42,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=22.5 2023-10-14 22:54:44,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1838158.0, ans=0.2 2023-10-14 22:54:51,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1838204.6666666667, ans=0.5 2023-10-14 22:55:16,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1838298.0, ans=0.125 2023-10-14 22:55:18,164 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1838298.0, ans=0.2 2023-10-14 22:55:20,144 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=12.0 2023-10-14 22:55:43,588 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-10-14 22:55:45,648 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1838391.3333333333, ans=0.125 2023-10-14 22:56:30,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.798e+02 1.920e+02 2.174e+02 3.078e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-14 22:56:30,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1838578.0, ans=0.1 2023-10-14 22:56:35,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-10-14 22:56:42,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1838624.6666666667, ans=0.0 2023-10-14 22:56:42,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1838624.6666666667, ans=0.1 2023-10-14 22:56:44,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1838671.3333333333, ans=0.0 2023-10-14 22:57:00,646 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-14 22:57:07,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1838764.6666666667, ans=0.2 2023-10-14 22:57:23,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1838811.3333333333, ans=10.0 2023-10-14 22:57:23,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1838811.3333333333, ans=0.125 2023-10-14 22:57:33,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1838858.0, ans=0.125 2023-10-14 22:57:34,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1838858.0, ans=0.125 2023-10-14 22:57:43,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1838904.6666666667, ans=0.125 2023-10-14 22:57:46,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1838951.3333333333, ans=0.125 2023-10-14 22:57:51,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1838951.3333333333, ans=0.0 2023-10-14 22:58:02,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1838998.0, ans=0.125 2023-10-14 22:58:14,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.97 vs. limit=6.0 2023-10-14 22:58:18,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.843e+02 2.032e+02 2.316e+02 3.006e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 22:58:29,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1839091.3333333333, ans=0.125 2023-10-14 22:58:38,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1839138.0, ans=0.125 2023-10-14 22:58:54,941 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1839184.6666666667, ans=0.125 2023-10-14 22:58:56,618 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-14 22:59:00,405 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=22.5 2023-10-14 22:59:23,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1839278.0, ans=0.125 2023-10-14 22:59:24,337 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1839278.0, ans=0.1 2023-10-14 22:59:32,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1839324.6666666667, ans=0.0 2023-10-14 23:00:08,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1839464.6666666667, ans=0.125 2023-10-14 23:00:10,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839464.6666666667, ans=0.1 2023-10-14 23:00:22,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1839511.3333333333, ans=0.04949747468305833 2023-10-14 23:00:25,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.823e+02 1.942e+02 2.123e+02 2.607e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-14 23:00:58,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839651.3333333333, ans=0.1 2023-10-14 23:01:01,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839698.0, ans=0.1 2023-10-14 23:01:03,694 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1839698.0, ans=0.125 2023-10-14 23:01:09,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1839698.0, ans=0.125 2023-10-14 23:01:15,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1839744.6666666667, ans=0.0 2023-10-14 23:01:43,297 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-10-14 23:01:45,164 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=22.5 2023-10-14 23:01:49,544 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.02 vs. limit=10.0 2023-10-14 23:01:58,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1839931.3333333333, ans=0.0 2023-10-14 23:02:06,782 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1839931.3333333333, ans=0.125 2023-10-14 23:02:18,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.815e+02 1.986e+02 2.186e+02 3.302e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-14 23:02:38,446 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.48 vs. limit=10.0 2023-10-14 23:02:44,149 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1840118.0, ans=0.125 2023-10-14 23:02:50,815 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1840118.0, ans=0.125 2023-10-14 23:02:54,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1840164.6666666667, ans=0.125 2023-10-14 23:03:01,809 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.63 vs. limit=15.0 2023-10-14 23:03:01,813 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-10-14 23:03:14,257 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.54 vs. limit=10.0 2023-10-14 23:03:14,466 INFO [train.py:1031] (3/4) Epoch 29, batch 12000, loss[loss=0.2198, simple_loss=0.2909, pruned_loss=0.07429, over 15751.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2773, pruned_loss=0.04537, over 32747038.16 frames. ], batch size: 350, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 23:03:14,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1840258.0, ans=0.125 2023-10-14 23:03:18,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1840258.0, ans=0.0 2023-10-14 23:03:36,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1840304.6666666667, ans=0.125 2023-10-14 23:03:49,385 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.58 vs. limit=10.0 2023-10-14 23:04:07,216 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.935e+02 2.157e+02 2.475e+02 3.360e+02, threshold=4.313e+02, percent-clipped=0.0 2023-10-14 23:04:07,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840444.6666666667, ans=0.1 2023-10-14 23:04:43,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1840631.3333333333, ans=0.2 2023-10-14 23:04:44,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1840631.3333333333, ans=0.2 2023-10-14 23:05:06,238 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840724.6666666667, ans=0.1 2023-10-14 23:05:15,693 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-10-14 23:05:28,401 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1840818.0, ans=0.0 2023-10-14 23:05:34,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1840818.0, ans=0.09899494936611666 2023-10-14 23:05:55,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.765e+02 1.960e+02 2.105e+02 3.597e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-14 23:06:06,158 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:06:16,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1841051.3333333333, ans=0.125 2023-10-14 23:06:17,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1841051.3333333333, ans=0.09899494936611666 2023-10-14 23:06:30,981 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1841098.0, ans=0.125 2023-10-14 23:06:37,301 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-10-14 23:07:32,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1841378.0, ans=0.125 2023-10-14 23:07:35,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1841378.0, ans=0.05 2023-10-14 23:07:41,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.871e+02 2.032e+02 2.252e+02 4.868e+02, threshold=4.064e+02, percent-clipped=1.0 2023-10-14 23:07:45,825 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:07:49,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1841424.6666666667, ans=0.05 2023-10-14 23:08:04,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1841518.0, ans=0.2 2023-10-14 23:08:14,851 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1841564.6666666667, ans=0.125 2023-10-14 23:08:21,174 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.18 vs. limit=10.0 2023-10-14 23:08:36,515 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:08:42,312 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.65 vs. limit=22.5 2023-10-14 23:08:43,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841658.0, ans=0.1 2023-10-14 23:09:05,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1841751.3333333333, ans=0.09899494936611666 2023-10-14 23:09:05,770 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-10-14 23:09:14,523 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1841798.0, ans=0.0 2023-10-14 23:09:31,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.877e+02 2.083e+02 2.351e+02 3.267e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-14 23:09:32,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1841844.6666666667, ans=0.0 2023-10-14 23:10:01,612 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-14 23:10:07,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1841984.6666666667, ans=0.07 2023-10-14 23:10:13,410 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1842031.3333333333, ans=0.125 2023-10-14 23:10:19,461 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1842031.3333333333, ans=0.5 2023-10-14 23:10:34,792 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1842124.6666666667, ans=0.1 2023-10-14 23:10:36,598 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1842124.6666666667, ans=0.0 2023-10-14 23:10:55,324 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.91 vs. limit=6.0 2023-10-14 23:11:12,077 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=15.0 2023-10-14 23:11:25,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.953e+02 2.145e+02 2.374e+02 3.106e+02, threshold=4.291e+02, percent-clipped=0.0 2023-10-14 23:11:36,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.76 vs. limit=15.0 2023-10-14 23:11:48,825 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1842404.6666666667, ans=0.1 2023-10-14 23:11:51,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1842451.3333333333, ans=0.2 2023-10-14 23:11:52,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1842451.3333333333, ans=0.0 2023-10-14 23:11:52,285 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1842451.3333333333, ans=0.125 2023-10-14 23:12:23,805 INFO [train.py:1031] (3/4) Epoch 29, batch 12500, loss[loss=0.1693, simple_loss=0.2473, pruned_loss=0.04567, over 12725.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2767, pruned_loss=0.04539, over 32710233.12 frames. ], batch size: 440, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 23:12:30,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1842591.3333333333, ans=0.2 2023-10-14 23:12:31,213 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1842591.3333333333, ans=0.125 2023-10-14 23:12:41,454 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1842638.0, ans=0.0 2023-10-14 23:12:46,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1842684.6666666667, ans=0.125 2023-10-14 23:12:47,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1842684.6666666667, ans=0.1 2023-10-14 23:13:02,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1842731.3333333333, ans=0.125 2023-10-14 23:13:14,126 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.908e+02 2.054e+02 2.262e+02 3.234e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-14 23:13:15,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1842824.6666666667, ans=0.2 2023-10-14 23:13:18,537 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1842824.6666666667, ans=0.0 2023-10-14 23:13:19,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1842824.6666666667, ans=0.1 2023-10-14 23:13:24,450 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1842824.6666666667, ans=0.125 2023-10-14 23:13:27,411 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.82 vs. limit=6.0 2023-10-14 23:13:35,188 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1842918.0, ans=0.0 2023-10-14 23:13:35,515 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.73 vs. limit=10.0 2023-10-14 23:13:37,769 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1842918.0, ans=0.125 2023-10-14 23:13:54,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-10-14 23:13:58,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1843011.3333333333, ans=0.2 2023-10-14 23:14:11,489 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1843058.0, ans=0.125 2023-10-14 23:14:57,786 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.62 vs. limit=10.0 2023-10-14 23:14:58,284 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1843244.6666666667, ans=0.0 2023-10-14 23:15:02,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.816e+02 1.996e+02 2.202e+02 2.932e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 23:15:21,413 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1843338.0, ans=0.0 2023-10-14 23:15:28,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1843384.6666666667, ans=0.0 2023-10-14 23:15:41,800 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-10-14 23:15:42,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1843431.3333333333, ans=0.0 2023-10-14 23:15:49,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843478.0, ans=0.1 2023-10-14 23:15:57,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1843478.0, ans=0.0 2023-10-14 23:15:58,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1843478.0, ans=0.125 2023-10-14 23:15:59,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1843524.6666666667, ans=0.07 2023-10-14 23:16:11,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1843571.3333333333, ans=0.125 2023-10-14 23:16:19,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1843618.0, ans=0.125 2023-10-14 23:16:20,179 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843618.0, ans=0.1 2023-10-14 23:16:50,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.786e+02 1.942e+02 2.228e+02 3.260e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-14 23:16:52,676 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-10-14 23:16:55,346 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1843758.0, ans=0.05 2023-10-14 23:17:20,449 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1843851.3333333333, ans=0.0 2023-10-14 23:18:01,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1844038.0, ans=0.0 2023-10-14 23:18:02,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1844038.0, ans=0.125 2023-10-14 23:18:05,721 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=22.5 2023-10-14 23:18:34,259 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=15.0 2023-10-14 23:18:36,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 1.887e+02 2.050e+02 2.182e+02 3.045e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-14 23:18:43,003 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-14 23:19:16,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1844364.6666666667, ans=0.07 2023-10-14 23:19:49,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1844504.6666666667, ans=0.125 2023-10-14 23:19:55,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1844551.3333333333, ans=0.125 2023-10-14 23:19:58,949 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.34 vs. limit=15.0 2023-10-14 23:20:06,788 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1844598.0, ans=0.2 2023-10-14 23:20:13,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1844598.0, ans=0.09899494936611666 2023-10-14 23:20:15,192 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1844598.0, ans=0.0 2023-10-14 23:20:17,251 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.31 vs. limit=15.0 2023-10-14 23:20:27,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.834e+02 1.966e+02 2.169e+02 2.783e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-14 23:20:49,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.38 vs. limit=15.0 2023-10-14 23:21:20,159 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1844878.0, ans=0.1 2023-10-14 23:21:21,798 INFO [train.py:1031] (3/4) Epoch 29, batch 13000, loss[loss=0.1882, simple_loss=0.2808, pruned_loss=0.0478, over 16608.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2773, pruned_loss=0.04559, over 32719584.92 frames. ], batch size: 219, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 23:21:29,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.19 vs. limit=6.0 2023-10-14 23:21:29,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-14 23:21:32,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1844971.3333333333, ans=0.0 2023-10-14 23:21:55,439 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=22.5 2023-10-14 23:22:00,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1845064.6666666667, ans=0.1 2023-10-14 23:22:01,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1845064.6666666667, ans=0.04949747468305833 2023-10-14 23:22:07,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1845064.6666666667, ans=0.125 2023-10-14 23:22:08,339 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845064.6666666667, ans=0.125 2023-10-14 23:22:08,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-10-14 23:22:21,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.854e+02 2.027e+02 2.210e+02 2.850e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-14 23:22:23,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-14 23:22:24,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1845158.0, ans=0.0 2023-10-14 23:22:29,073 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1845158.0, ans=0.0 2023-10-14 23:22:44,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1845204.6666666667, ans=0.0 2023-10-14 23:22:55,510 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845251.3333333333, ans=0.1 2023-10-14 23:23:00,964 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:23:02,762 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845298.0, ans=0.125 2023-10-14 23:23:12,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1845344.6666666667, ans=0.0 2023-10-14 23:23:19,231 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-10-14 23:23:34,622 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845438.0, ans=0.1 2023-10-14 23:24:09,507 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1845578.0, ans=0.125 2023-10-14 23:24:12,024 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1845578.0, ans=0.125 2023-10-14 23:24:15,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.796e+02 1.922e+02 2.112e+02 3.031e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-14 23:24:17,768 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:24:44,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1845718.0, ans=0.0 2023-10-14 23:25:14,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1845811.3333333333, ans=0.07 2023-10-14 23:25:38,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1845951.3333333333, ans=0.0 2023-10-14 23:25:47,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1845951.3333333333, ans=0.125 2023-10-14 23:25:54,170 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1845998.0, ans=0.125 2023-10-14 23:26:02,342 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.30 vs. limit=22.5 2023-10-14 23:26:08,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.773e+02 1.943e+02 2.081e+02 2.778e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-14 23:26:17,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.81 vs. limit=10.0 2023-10-14 23:26:21,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1846091.3333333333, ans=0.0 2023-10-14 23:26:24,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1846138.0, ans=0.2 2023-10-14 23:26:25,912 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1846138.0, ans=0.1 2023-10-14 23:26:34,297 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1846184.6666666667, ans=0.0 2023-10-14 23:26:34,329 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1846184.6666666667, ans=0.0 2023-10-14 23:26:34,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1846184.6666666667, ans=0.125 2023-10-14 23:26:43,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1846184.6666666667, ans=0.5 2023-10-14 23:27:01,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1846278.0, ans=0.025 2023-10-14 23:27:05,538 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:27:17,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1846324.6666666667, ans=0.1 2023-10-14 23:27:20,809 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:27:25,819 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-14 23:27:32,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=22.5 2023-10-14 23:27:34,401 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:27:35,388 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1846418.0, ans=0.0 2023-10-14 23:27:56,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1846511.3333333333, ans=0.0 2023-10-14 23:27:57,602 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1846511.3333333333, ans=0.125 2023-10-14 23:27:58,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.875e+02 2.033e+02 2.193e+02 3.109e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-14 23:28:16,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1846604.6666666667, ans=0.125 2023-10-14 23:28:18,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1846604.6666666667, ans=0.125 2023-10-14 23:28:26,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1846651.3333333333, ans=0.0 2023-10-14 23:28:45,715 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1846744.6666666667, ans=0.0 2023-10-14 23:28:45,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1846744.6666666667, ans=0.0 2023-10-14 23:29:15,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1846838.0, ans=0.2 2023-10-14 23:29:17,938 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=22.5 2023-10-14 23:29:18,915 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2023-10-14 23:29:21,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.28 vs. limit=15.0 2023-10-14 23:29:40,258 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1846931.3333333333, ans=0.125 2023-10-14 23:29:52,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.799e+02 1.994e+02 2.284e+02 3.347e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-14 23:30:05,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847071.3333333333, ans=0.1 2023-10-14 23:30:21,834 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1847118.0, ans=10.0 2023-10-14 23:30:23,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1847118.0, ans=0.0 2023-10-14 23:30:27,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-10-14 23:30:37,357 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847211.3333333333, ans=0.1 2023-10-14 23:30:39,894 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:30:49,498 INFO [train.py:1031] (3/4) Epoch 29, batch 13500, loss[loss=0.2066, simple_loss=0.2815, pruned_loss=0.06585, over 15647.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2768, pruned_loss=0.0456, over 32719314.96 frames. ], batch size: 350, lr: 1.17e-03, grad_scale: 16.0 2023-10-14 23:30:52,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847258.0, ans=0.1 2023-10-14 23:30:55,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1847258.0, ans=0.0 2023-10-14 23:31:09,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1847304.6666666667, ans=0.125 2023-10-14 23:31:28,839 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1847398.0, ans=0.1 2023-10-14 23:31:45,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.865e+02 2.062e+02 2.247e+02 3.122e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-14 23:31:52,913 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-10-14 23:32:10,988 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=12.0 2023-10-14 23:32:19,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1847631.3333333333, ans=0.125 2023-10-14 23:32:47,167 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.32 vs. limit=15.0 2023-10-14 23:32:53,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1847771.3333333333, ans=0.1 2023-10-14 23:33:00,969 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1847818.0, ans=0.1 2023-10-14 23:33:06,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1847818.0, ans=0.0 2023-10-14 23:33:14,680 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-10-14 23:33:25,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.908e+02 2.134e+02 2.335e+02 3.342e+02, threshold=4.268e+02, percent-clipped=0.0 2023-10-14 23:34:01,800 INFO [train.py:1031] (3/4) Epoch 30, batch 0, loss[loss=0.1699, simple_loss=0.2663, pruned_loss=0.03681, over 16879.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2663, pruned_loss=0.03681, over 16879.00 frames. ], batch size: 87, lr: 1.15e-03, grad_scale: 32.0 2023-10-14 23:34:01,801 INFO [train.py:1054] (3/4) Computing validation loss 2023-10-14 23:34:04,893 INFO [zipformer.py:1853] (3/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.6668, 5.0666, 5.3935, 4.9297], device='cuda:3') 2023-10-14 23:34:09,359 INFO [train.py:1063] (3/4) Epoch 30, validation: loss=0.2121, simple_loss=0.2987, pruned_loss=0.06271, over 1020973.00 frames. 2023-10-14 23:34:09,360 INFO [train.py:1064] (3/4) Maximum memory allocated so far is 16953MB 2023-10-14 23:34:15,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1847981.3333333333, ans=0.2 2023-10-14 23:34:28,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1848028.0, ans=0.125 2023-10-14 23:34:33,135 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1848074.6666666667, ans=0.125 2023-10-14 23:34:49,457 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1848121.3333333333, ans=0.0 2023-10-14 23:35:02,541 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:35:08,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1848214.6666666667, ans=0.1 2023-10-14 23:35:18,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1848261.3333333333, ans=0.125 2023-10-14 23:35:31,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1848308.0, ans=0.0 2023-10-14 23:35:34,130 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1848308.0, ans=0.2 2023-10-14 23:35:41,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1848354.6666666667, ans=0.125 2023-10-14 23:35:56,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.802e+02 1.972e+02 2.221e+02 2.697e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 23:35:58,643 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.34 vs. limit=15.0 2023-10-14 23:36:09,948 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1848448.0, ans=15.0 2023-10-14 23:36:20,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=22.5 2023-10-14 23:36:47,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1848634.6666666667, ans=0.125 2023-10-14 23:37:30,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1848821.3333333333, ans=0.125 2023-10-14 23:37:45,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.856e+02 1.984e+02 2.208e+02 2.663e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 23:37:48,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848868.0, ans=0.1 2023-10-14 23:37:48,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1848868.0, ans=0.125 2023-10-14 23:38:02,653 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.51 vs. limit=12.0 2023-10-14 23:38:05,844 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1848961.3333333333, ans=0.125 2023-10-14 23:38:08,483 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-10-14 23:38:47,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849101.3333333333, ans=0.1 2023-10-14 23:38:49,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1849148.0, ans=0.0 2023-10-14 23:39:10,188 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.46 vs. limit=22.5 2023-10-14 23:39:12,649 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849241.3333333333, ans=0.1 2023-10-14 23:39:14,496 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2023-10-14 23:39:28,745 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:39:38,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.861e+02 2.020e+02 2.185e+02 2.894e+02, threshold=4.041e+02, percent-clipped=0.0 2023-10-14 23:39:52,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1849381.3333333333, ans=0.2 2023-10-14 23:39:54,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1849428.0, ans=0.125 2023-10-14 23:39:56,247 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1849428.0, ans=0.125 2023-10-14 23:40:08,327 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1849474.6666666667, ans=0.125 2023-10-14 23:40:12,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1849474.6666666667, ans=0.0 2023-10-14 23:40:29,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1849568.0, ans=0.2 2023-10-14 23:40:33,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1849568.0, ans=0.125 2023-10-14 23:40:36,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1849614.6666666667, ans=0.0 2023-10-14 23:40:52,408 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1849661.3333333333, ans=0.125 2023-10-14 23:41:14,100 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1849754.6666666667, ans=0.0 2023-10-14 23:41:20,298 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=12.0 2023-10-14 23:41:24,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.893e+02 2.014e+02 2.285e+02 3.360e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-14 23:41:26,652 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-10-14 23:41:31,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1849848.0, ans=0.0 2023-10-14 23:42:11,370 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1849988.0, ans=0.0 2023-10-14 23:42:41,919 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1850128.0, ans=0.125 2023-10-14 23:42:42,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1850128.0, ans=0.1 2023-10-14 23:42:45,583 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1850128.0, ans=0.0 2023-10-14 23:43:17,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.887e+02 2.049e+02 2.316e+02 3.250e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 23:43:23,063 INFO [train.py:1031] (3/4) Epoch 30, batch 500, loss[loss=0.2042, simple_loss=0.2877, pruned_loss=0.06036, over 15615.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2782, pruned_loss=0.04609, over 7317528.11 frames. ], batch size: 35, lr: 1.15e-03, grad_scale: 16.0 2023-10-14 23:43:43,985 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.54 vs. limit=15.0 2023-10-14 23:43:45,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1850408.0, ans=0.1 2023-10-14 23:43:48,306 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.18 vs. limit=5.0 2023-10-14 23:44:01,892 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.62 vs. limit=10.0 2023-10-14 23:44:36,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1850594.6666666667, ans=0.125 2023-10-14 23:44:44,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-14 23:44:44,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850641.3333333333, ans=0.1 2023-10-14 23:45:00,400 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1850688.0, ans=0.125 2023-10-14 23:45:06,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1850734.6666666667, ans=0.0 2023-10-14 23:45:07,701 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1850734.6666666667, ans=0.0 2023-10-14 23:45:08,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.887e+02 2.080e+02 2.271e+02 2.941e+02, threshold=4.160e+02, percent-clipped=0.0 2023-10-14 23:45:09,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1850734.6666666667, ans=0.1 2023-10-14 23:45:13,824 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1850781.3333333333, ans=0.0 2023-10-14 23:45:15,691 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.07 vs. limit=15.0 2023-10-14 23:45:22,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1850781.3333333333, ans=0.125 2023-10-14 23:45:29,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1850828.0, ans=0.0 2023-10-14 23:45:37,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-10-14 23:45:49,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1850921.3333333333, ans=0.04949747468305833 2023-10-14 23:45:55,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850968.0, ans=0.1 2023-10-14 23:46:03,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1850968.0, ans=0.015 2023-10-14 23:46:27,697 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1851061.3333333333, ans=0.0 2023-10-14 23:46:42,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1851154.6666666667, ans=0.05 2023-10-14 23:46:44,613 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1851154.6666666667, ans=0.0 2023-10-14 23:46:44,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1851154.6666666667, ans=0.0 2023-10-14 23:46:48,970 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851201.3333333333, ans=0.1 2023-10-14 23:46:53,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851201.3333333333, ans=0.1 2023-10-14 23:46:56,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.870e+02 2.052e+02 2.234e+02 3.246e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-14 23:47:01,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1851248.0, ans=0.2 2023-10-14 23:47:35,248 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1851388.0, ans=0.0 2023-10-14 23:47:44,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1851434.6666666667, ans=0.125 2023-10-14 23:48:00,609 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1851481.3333333333, ans=0.125 2023-10-14 23:48:05,765 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851481.3333333333, ans=0.1 2023-10-14 23:48:11,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1851528.0, ans=0.125 2023-10-14 23:48:12,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1851528.0, ans=0.5 2023-10-14 23:48:12,595 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1851528.0, ans=0.0 2023-10-14 23:48:15,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1851528.0, ans=0.2 2023-10-14 23:48:20,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1851574.6666666667, ans=0.0 2023-10-14 23:48:46,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.881e+02 2.075e+02 2.337e+02 2.891e+02, threshold=4.151e+02, percent-clipped=0.0 2023-10-14 23:48:51,124 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.67 vs. limit=15.0 2023-10-14 23:48:57,363 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-10-14 23:49:15,472 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.51 vs. limit=22.5 2023-10-14 23:49:18,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=12.0 2023-10-14 23:50:00,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1851948.0, ans=0.125 2023-10-14 23:50:09,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1851994.6666666667, ans=10.0 2023-10-14 23:50:21,901 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1852041.3333333333, ans=0.125 2023-10-14 23:50:33,151 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1852088.0, ans=0.07 2023-10-14 23:50:46,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.886e+02 2.080e+02 2.343e+02 2.900e+02, threshold=4.161e+02, percent-clipped=0.0 2023-10-14 23:50:49,183 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1852134.6666666667, ans=0.1 2023-10-14 23:50:50,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1852181.3333333333, ans=0.2 2023-10-14 23:50:56,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1852181.3333333333, ans=0.2 2023-10-14 23:51:04,477 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1852228.0, ans=0.09899494936611666 2023-10-14 23:51:08,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1852228.0, ans=0.125 2023-10-14 23:51:10,552 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1852228.0, ans=0.125 2023-10-14 23:51:15,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1852274.6666666667, ans=0.125 2023-10-14 23:51:15,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852274.6666666667, ans=0.1 2023-10-14 23:51:27,539 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.12 vs. limit=22.5 2023-10-14 23:51:49,573 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2023-10-14 23:52:11,193 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1852508.0, ans=0.0 2023-10-14 23:52:13,260 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-10-14 23:52:23,746 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-10-14 23:52:41,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.884e+02 2.035e+02 2.238e+02 3.170e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-14 23:52:45,209 INFO [train.py:1031] (3/4) Epoch 30, batch 1000, loss[loss=0.1763, simple_loss=0.2733, pruned_loss=0.03966, over 16364.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2781, pruned_loss=0.04635, over 12924776.47 frames. ], batch size: 50, lr: 1.15e-03, grad_scale: 16.0 2023-10-14 23:52:51,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2023-10-14 23:52:52,270 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1852648.0, ans=0.0 2023-10-14 23:52:57,897 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852694.6666666667, ans=0.1 2023-10-14 23:53:20,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1852788.0, ans=0.0 2023-10-14 23:53:34,092 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-10-14 23:53:58,686 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1852974.6666666667, ans=0.1 2023-10-14 23:54:26,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.44 vs. limit=6.0 2023-10-14 23:54:27,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.829e+02 2.006e+02 2.205e+02 3.000e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-14 23:54:34,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1853114.6666666667, ans=0.125 2023-10-14 23:54:43,685 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853161.3333333333, ans=0.0 2023-10-14 23:54:50,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.32 vs. limit=10.0 2023-10-14 23:54:53,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1853161.3333333333, ans=0.09899494936611666 2023-10-14 23:55:10,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1853254.6666666667, ans=0.04949747468305833 2023-10-14 23:55:37,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1853348.0, ans=0.0 2023-10-14 23:55:48,741 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1853394.6666666667, ans=0.0 2023-10-14 23:55:56,395 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1853441.3333333333, ans=0.125 2023-10-14 23:56:12,177 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1853488.0, ans=0.125 2023-10-14 23:56:14,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1853488.0, ans=0.2 2023-10-14 23:56:26,833 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853534.6666666667, ans=0.1 2023-10-14 23:56:31,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.802e+02 1.947e+02 2.164e+02 4.554e+02, threshold=3.894e+02, percent-clipped=1.0 2023-10-14 23:56:48,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1853628.0, ans=0.0 2023-10-14 23:57:05,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1853674.6666666667, ans=0.125 2023-10-14 23:57:11,191 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1853721.3333333333, ans=0.125 2023-10-14 23:57:36,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1853814.6666666667, ans=0.1 2023-10-14 23:57:38,126 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.89 vs. limit=15.0 2023-10-14 23:58:02,531 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1853954.6666666667, ans=0.025 2023-10-14 23:58:23,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.749e+02 1.871e+02 2.037e+02 2.473e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-14 23:58:24,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1854048.0, ans=0.0 2023-10-14 23:58:31,296 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:58:40,632 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1854094.6666666667, ans=0.1 2023-10-14 23:58:58,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1854188.0, ans=0.125 2023-10-14 23:59:13,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1854234.6666666667, ans=0.0 2023-10-14 23:59:14,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1854234.6666666667, ans=0.0 2023-10-14 23:59:25,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854281.3333333333, ans=0.1 2023-10-14 23:59:33,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1854328.0, ans=22.5 2023-10-14 23:59:45,123 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1854374.6666666667, ans=0.125 2023-10-14 23:59:48,551 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1854374.6666666667, ans=0.0 2023-10-14 23:59:56,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1854421.3333333333, ans=0.125 2023-10-15 00:00:03,331 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1854468.0, ans=0.125 2023-10-15 00:00:12,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.783e+02 1.971e+02 2.180e+02 3.057e+02, threshold=3.942e+02, percent-clipped=0.0 2023-10-15 00:00:20,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854514.6666666667, ans=0.1 2023-10-15 00:00:24,197 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854561.3333333333, ans=0.1 2023-10-15 00:00:28,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1854561.3333333333, ans=0.2 2023-10-15 00:00:31,317 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.47 vs. limit=12.0 2023-10-15 00:00:33,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1854561.3333333333, ans=0.125 2023-10-15 00:00:36,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-10-15 00:00:44,539 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:00:56,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1854654.6666666667, ans=0.0 2023-10-15 00:01:08,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1854748.0, ans=0.125 2023-10-15 00:01:09,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1854748.0, ans=0.125 2023-10-15 00:01:11,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854748.0, ans=0.1 2023-10-15 00:01:19,972 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1854794.6666666667, ans=0.125 2023-10-15 00:01:31,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1854841.3333333333, ans=0.125 2023-10-15 00:01:53,478 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2023-10-15 00:01:55,022 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1854934.6666666667, ans=0.0 2023-10-15 00:02:00,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854934.6666666667, ans=0.1 2023-10-15 00:02:04,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.828e+02 2.019e+02 2.227e+02 3.813e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-15 00:02:05,725 INFO [train.py:1031] (3/4) Epoch 30, batch 1500, loss[loss=0.1728, simple_loss=0.2683, pruned_loss=0.03862, over 16863.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2765, pruned_loss=0.04555, over 17327720.89 frames. ], batch size: 110, lr: 1.15e-03, grad_scale: 8.0 2023-10-15 00:02:16,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1855028.0, ans=0.125 2023-10-15 00:02:34,730 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1855074.6666666667, ans=0.02 2023-10-15 00:02:35,087 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-15 00:02:47,162 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1855121.3333333333, ans=0.125 2023-10-15 00:03:04,798 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1855214.6666666667, ans=0.125 2023-10-15 00:03:05,269 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.71 vs. limit=22.5 2023-10-15 00:03:13,326 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=12.0 2023-10-15 00:03:24,664 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1855308.0, ans=0.125 2023-10-15 00:03:27,447 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1855308.0, ans=0.05 2023-10-15 00:03:29,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1855308.0, ans=0.0 2023-10-15 00:03:29,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1855308.0, ans=0.125 2023-10-15 00:03:38,351 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1855354.6666666667, ans=0.0 2023-10-15 00:03:49,662 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-10-15 00:03:58,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.870e+02 2.076e+02 2.405e+02 3.528e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-15 00:03:59,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1855448.0, ans=0.125 2023-10-15 00:04:20,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1855541.3333333333, ans=0.0 2023-10-15 00:04:30,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1855541.3333333333, ans=0.015 2023-10-15 00:04:56,249 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1855634.6666666667, ans=0.125 2023-10-15 00:05:07,267 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855681.3333333333, ans=0.1 2023-10-15 00:05:18,467 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.91 vs. limit=22.5 2023-10-15 00:05:25,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1855774.6666666667, ans=0.125 2023-10-15 00:05:32,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1855821.3333333333, ans=0.125 2023-10-15 00:05:32,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1855821.3333333333, ans=0.125 2023-10-15 00:05:38,438 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1855821.3333333333, ans=10.0 2023-10-15 00:05:39,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-15 00:05:53,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.868e+02 2.043e+02 2.325e+02 3.101e+02, threshold=4.085e+02, percent-clipped=0.0 2023-10-15 00:06:00,165 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1855914.6666666667, ans=0.05 2023-10-15 00:06:01,040 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1855914.6666666667, ans=0.125 2023-10-15 00:06:09,568 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=15.0 2023-10-15 00:06:20,436 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856008.0, ans=0.1 2023-10-15 00:06:21,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-10-15 00:06:25,964 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:06:43,745 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856101.3333333333, ans=0.1 2023-10-15 00:06:47,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1856148.0, ans=0.0 2023-10-15 00:06:54,794 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1856194.6666666667, ans=0.125 2023-10-15 00:07:13,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1856241.3333333333, ans=0.0 2023-10-15 00:07:43,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.805e+02 1.963e+02 2.142e+02 3.150e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-15 00:07:54,495 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-10-15 00:08:12,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1856474.6666666667, ans=0.015 2023-10-15 00:08:13,084 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856474.6666666667, ans=0.1 2023-10-15 00:08:17,451 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1856521.3333333333, ans=0.125 2023-10-15 00:08:34,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1856568.0, ans=0.125 2023-10-15 00:08:54,995 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856661.3333333333, ans=0.1 2023-10-15 00:08:59,934 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1856708.0, ans=0.0 2023-10-15 00:09:09,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1856708.0, ans=0.125 2023-10-15 00:09:10,062 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856708.0, ans=0.1 2023-10-15 00:09:14,024 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:09:14,085 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:09:22,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=12.0 2023-10-15 00:09:23,101 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1856801.3333333333, ans=0.0 2023-10-15 00:09:31,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1856801.3333333333, ans=0.0 2023-10-15 00:09:32,768 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1856848.0, ans=0.125 2023-10-15 00:09:33,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.844e+02 1.984e+02 2.214e+02 3.420e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-15 00:09:35,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1856848.0, ans=0.1 2023-10-15 00:09:43,041 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=15.0 2023-10-15 00:09:45,377 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1856894.6666666667, ans=0.0 2023-10-15 00:10:02,646 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1856941.3333333333, ans=0.0 2023-10-15 00:10:07,322 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1856988.0, ans=0.125 2023-10-15 00:10:56,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1857128.0, ans=0.0 2023-10-15 00:11:02,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1857174.6666666667, ans=0.09899494936611666 2023-10-15 00:11:27,210 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1857268.0, ans=0.2 2023-10-15 00:11:32,300 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1857268.0, ans=0.125 2023-10-15 00:11:37,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.808e+02 1.953e+02 2.178e+02 3.476e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-15 00:11:37,178 INFO [train.py:1031] (3/4) Epoch 30, batch 2000, loss[loss=0.2139, simple_loss=0.2938, pruned_loss=0.06701, over 15655.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2772, pruned_loss=0.04582, over 20753227.84 frames. ], batch size: 350, lr: 1.15e-03, grad_scale: 16.0 2023-10-15 00:11:49,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1857361.3333333333, ans=0.125 2023-10-15 00:11:58,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1857361.3333333333, ans=0.125 2023-10-15 00:11:59,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1857361.3333333333, ans=0.0 2023-10-15 00:12:05,791 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1857408.0, ans=0.125 2023-10-15 00:13:30,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1857734.6666666667, ans=0.0 2023-10-15 00:13:38,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.779e+02 1.970e+02 2.192e+02 3.720e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-15 00:13:39,498 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.26 vs. limit=15.0 2023-10-15 00:14:39,588 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1857921.3333333333, ans=0.0 2023-10-15 00:14:39,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1857921.3333333333, ans=0.125 2023-10-15 00:14:42,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=22.5 2023-10-15 00:15:03,214 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1858014.6666666667, ans=0.125 2023-10-15 00:15:13,718 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.95 vs. limit=10.0 2023-10-15 00:15:37,958 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858154.6666666667, ans=0.1 2023-10-15 00:15:40,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1858201.3333333333, ans=0.125 2023-10-15 00:15:40,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=22.5 2023-10-15 00:15:42,682 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-15 00:15:45,621 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0 2023-10-15 00:15:53,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.880e+02 2.060e+02 2.209e+02 3.063e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-15 00:16:04,316 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1858294.6666666667, ans=0.125 2023-10-15 00:16:08,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1858294.6666666667, ans=0.0 2023-10-15 00:16:21,554 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1858341.3333333333, ans=0.1 2023-10-15 00:16:46,041 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:16:54,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1858481.3333333333, ans=0.0 2023-10-15 00:16:57,076 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-15 00:17:07,079 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1858528.0, ans=0.1 2023-10-15 00:17:11,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1858574.6666666667, ans=0.0 2023-10-15 00:17:23,432 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1858621.3333333333, ans=0.2 2023-10-15 00:17:27,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1858621.3333333333, ans=0.1 2023-10-15 00:17:34,826 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-10-15 00:17:40,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.838e+02 2.043e+02 2.258e+02 3.292e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-15 00:17:46,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1858714.6666666667, ans=0.125 2023-10-15 00:17:47,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1858714.6666666667, ans=0.2 2023-10-15 00:17:50,420 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1858761.3333333333, ans=0.125 2023-10-15 00:17:59,572 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1858761.3333333333, ans=0.0 2023-10-15 00:18:16,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-10-15 00:18:24,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1858901.3333333333, ans=0.035 2023-10-15 00:18:34,527 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.50 vs. limit=15.0 2023-10-15 00:18:35,982 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1858948.0, ans=0.125 2023-10-15 00:18:37,021 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1858948.0, ans=0.0 2023-10-15 00:18:46,736 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1858994.6666666667, ans=0.125 2023-10-15 00:19:01,596 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859041.3333333333, ans=0.1 2023-10-15 00:19:03,325 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1859041.3333333333, ans=0.07 2023-10-15 00:19:08,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1859088.0, ans=0.125 2023-10-15 00:19:11,138 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1859088.0, ans=0.0 2023-10-15 00:19:15,442 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1859088.0, ans=0.125 2023-10-15 00:19:19,167 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859134.6666666667, ans=0.1 2023-10-15 00:19:28,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.919e+02 2.075e+02 2.253e+02 2.869e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-15 00:19:39,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1859228.0, ans=0.0 2023-10-15 00:19:52,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1859274.6666666667, ans=0.0 2023-10-15 00:20:03,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1859321.3333333333, ans=0.125 2023-10-15 00:20:19,581 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1859368.0, ans=0.2 2023-10-15 00:20:36,096 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1859461.3333333333, ans=0.125 2023-10-15 00:20:48,936 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:20:52,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1859508.0, ans=0.0 2023-10-15 00:20:53,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1859554.6666666667, ans=0.04949747468305833 2023-10-15 00:20:59,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1859554.6666666667, ans=0.1 2023-10-15 00:21:04,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1859601.3333333333, ans=0.125 2023-10-15 00:21:07,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1859601.3333333333, ans=0.125 2023-10-15 00:21:15,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.817e+02 1.939e+02 2.098e+02 2.744e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-15 00:21:15,441 INFO [train.py:1031] (3/4) Epoch 30, batch 2500, loss[loss=0.1679, simple_loss=0.269, pruned_loss=0.03341, over 16907.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2772, pruned_loss=0.04589, over 23414212.89 frames. ], batch size: 93, lr: 1.15e-03, grad_scale: 32.0 2023-10-15 00:21:16,669 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1859648.0, ans=0.0 2023-10-15 00:21:17,822 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-15 00:21:22,321 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.47 vs. limit=12.0 2023-10-15 00:21:24,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1859694.6666666667, ans=0.125 2023-10-15 00:21:24,651 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1859694.6666666667, ans=0.125 2023-10-15 00:21:43,666 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1859741.3333333333, ans=0.125 2023-10-15 00:21:48,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1859788.0, ans=0.125 2023-10-15 00:21:50,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1859788.0, ans=0.09899494936611666 2023-10-15 00:21:57,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1859834.6666666667, ans=0.0 2023-10-15 00:21:59,987 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=8.0 2023-10-15 00:22:06,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.49 vs. limit=15.0 2023-10-15 00:22:14,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1859881.3333333333, ans=0.125 2023-10-15 00:22:35,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1859974.6666666667, ans=0.05 2023-10-15 00:22:38,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1860021.3333333333, ans=0.125 2023-10-15 00:22:43,834 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.23 vs. limit=10.0 2023-10-15 00:22:52,325 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-10-15 00:22:59,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.847e+02 2.029e+02 2.222e+02 3.312e+02, threshold=4.058e+02, percent-clipped=0.0 2023-10-15 00:23:07,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.54 vs. limit=6.0 2023-10-15 00:23:22,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1860208.0, ans=0.0 2023-10-15 00:23:24,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1860208.0, ans=0.125 2023-10-15 00:23:32,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1860208.0, ans=0.125 2023-10-15 00:23:34,442 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:24:13,722 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1860394.6666666667, ans=0.125 2023-10-15 00:24:42,106 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1860534.6666666667, ans=0.125 2023-10-15 00:24:42,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860534.6666666667, ans=0.1 2023-10-15 00:24:45,073 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:24:54,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.867e+02 2.050e+02 2.248e+02 3.055e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-15 00:25:18,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1860674.6666666667, ans=0.09899494936611666 2023-10-15 00:25:19,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.59 vs. limit=22.5 2023-10-15 00:25:23,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1860674.6666666667, ans=0.2 2023-10-15 00:25:33,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1860721.3333333333, ans=0.0 2023-10-15 00:25:35,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1860721.3333333333, ans=0.0 2023-10-15 00:25:38,606 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1860721.3333333333, ans=0.125 2023-10-15 00:25:56,040 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.56 vs. limit=15.0 2023-10-15 00:26:13,114 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1860861.3333333333, ans=0.125 2023-10-15 00:26:14,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1860861.3333333333, ans=0.125 2023-10-15 00:26:22,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=1860908.0, ans=15.0 2023-10-15 00:26:23,903 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:26:31,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1860954.6666666667, ans=0.1 2023-10-15 00:26:48,229 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-15 00:26:49,219 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-10-15 00:26:53,154 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-10-15 00:26:54,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.817e+02 1.951e+02 2.202e+02 3.582e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-15 00:26:58,815 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.13 vs. limit=15.0 2023-10-15 00:27:37,362 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-10-15 00:27:49,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1861234.6666666667, ans=0.1 2023-10-15 00:28:12,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1861328.0, ans=0.1 2023-10-15 00:28:27,829 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1861374.6666666667, ans=0.0 2023-10-15 00:28:56,458 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1861514.6666666667, ans=0.0 2023-10-15 00:28:58,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.819e+02 1.967e+02 2.136e+02 2.828e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-15 00:29:04,856 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1861514.6666666667, ans=0.1 2023-10-15 00:29:14,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1861561.3333333333, ans=0.1 2023-10-15 00:29:38,928 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-10-15 00:29:40,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1861701.3333333333, ans=0.125 2023-10-15 00:29:56,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1861748.0, ans=0.125 2023-10-15 00:29:58,330 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1861748.0, ans=0.0 2023-10-15 00:30:00,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1861794.6666666667, ans=0.0 2023-10-15 00:30:07,611 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1861794.6666666667, ans=0.0 2023-10-15 00:30:08,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.87 vs. limit=15.0 2023-10-15 00:30:21,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1861888.0, ans=0.125 2023-10-15 00:30:31,031 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1861888.0, ans=0.0 2023-10-15 00:30:43,258 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.36 vs. limit=15.0 2023-10-15 00:30:44,615 INFO [train.py:1031] (3/4) Epoch 30, batch 3000, loss[loss=0.1845, simple_loss=0.2762, pruned_loss=0.04641, over 16587.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2766, pruned_loss=0.04591, over 25477210.12 frames. ], batch size: 219, lr: 1.15e-03, grad_scale: 16.0 2023-10-15 00:30:46,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.798e+02 1.977e+02 2.189e+02 4.054e+02, threshold=3.955e+02, percent-clipped=1.0 2023-10-15 00:30:54,861 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1862028.0, ans=0.0 2023-10-15 00:31:06,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1862074.6666666667, ans=0.125 2023-10-15 00:31:11,439 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1862074.6666666667, ans=0.09899494936611666 2023-10-15 00:31:12,812 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-10-15 00:31:29,221 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1862168.0, ans=0.125 2023-10-15 00:31:54,748 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1862261.3333333333, ans=0.04949747468305833 2023-10-15 00:31:54,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1862261.3333333333, ans=0.0 2023-10-15 00:32:04,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-15 00:32:27,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1862401.3333333333, ans=0.0 2023-10-15 00:32:39,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.915e+02 2.077e+02 2.261e+02 4.211e+02, threshold=4.154e+02, percent-clipped=1.0 2023-10-15 00:32:45,754 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1862448.0, ans=0.2 2023-10-15 00:32:55,960 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1862494.6666666667, ans=0.125 2023-10-15 00:33:14,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1862588.0, ans=0.125 2023-10-15 00:33:19,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1862588.0, ans=0.125 2023-10-15 00:33:22,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1862588.0, ans=0.2 2023-10-15 00:33:30,888 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1862634.6666666667, ans=0.1 2023-10-15 00:33:32,680 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1862634.6666666667, ans=0.1 2023-10-15 00:33:49,133 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.29 vs. limit=22.5 2023-10-15 00:34:20,859 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1862868.0, ans=0.125 2023-10-15 00:34:20,921 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1862868.0, ans=0.2 2023-10-15 00:34:26,559 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1862868.0, ans=0.0 2023-10-15 00:34:28,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1862914.6666666667, ans=0.2 2023-10-15 00:34:29,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.788e+02 1.983e+02 2.184e+02 2.870e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-15 00:34:34,303 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2023-10-15 00:34:34,923 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1862914.6666666667, ans=0.125 2023-10-15 00:34:57,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1863008.0, ans=0.0 2023-10-15 00:34:57,372 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-10-15 00:34:58,485 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1863008.0, ans=0.125 2023-10-15 00:35:09,336 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-15 00:35:25,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863101.3333333333, ans=0.1 2023-10-15 00:35:26,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1863101.3333333333, ans=0.0 2023-10-15 00:35:45,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1863194.6666666667, ans=0.125 2023-10-15 00:36:19,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1863288.0, ans=0.125 2023-10-15 00:36:21,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1863288.0, ans=0.125 2023-10-15 00:36:36,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.829e+02 2.037e+02 2.240e+02 2.885e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-15 00:36:36,573 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1863381.3333333333, ans=0.0 2023-10-15 00:37:01,066 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:37:07,747 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=22.5 2023-10-15 00:37:20,896 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1863568.0, ans=0.125 2023-10-15 00:37:30,078 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1863614.6666666667, ans=0.125 2023-10-15 00:37:37,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1863661.3333333333, ans=0.125 2023-10-15 00:37:45,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1863661.3333333333, ans=0.2 2023-10-15 00:37:45,298 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:37:57,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1863708.0, ans=0.1 2023-10-15 00:38:02,019 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1863754.6666666667, ans=0.125 2023-10-15 00:38:02,147 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1863754.6666666667, ans=0.0 2023-10-15 00:38:18,571 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1863801.3333333333, ans=0.2 2023-10-15 00:38:24,872 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1863848.0, ans=0.0 2023-10-15 00:38:24,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1863848.0, ans=0.0 2023-10-15 00:38:27,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.887e+02 2.029e+02 2.203e+02 2.920e+02, threshold=4.057e+02, percent-clipped=0.0 2023-10-15 00:38:45,039 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1863941.3333333333, ans=0.0 2023-10-15 00:38:57,328 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1863988.0, ans=0.125 2023-10-15 00:39:20,142 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.95 vs. limit=10.0 2023-10-15 00:39:24,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1864081.3333333333, ans=0.125 2023-10-15 00:39:49,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1864221.3333333333, ans=0.0 2023-10-15 00:39:49,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1864221.3333333333, ans=0.04949747468305833 2023-10-15 00:39:53,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1864221.3333333333, ans=0.125 2023-10-15 00:39:55,381 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1864221.3333333333, ans=0.125 2023-10-15 00:39:56,437 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1864221.3333333333, ans=0.2 2023-10-15 00:39:56,713 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-10-15 00:39:58,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1864221.3333333333, ans=0.0 2023-10-15 00:40:03,863 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1864268.0, ans=0.1 2023-10-15 00:40:13,534 INFO [train.py:1031] (3/4) Epoch 30, batch 3500, loss[loss=0.1815, simple_loss=0.2792, pruned_loss=0.04195, over 16690.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2767, pruned_loss=0.046, over 27096109.79 frames. ], batch size: 202, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 00:40:16,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.842e+02 1.992e+02 2.110e+02 3.078e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-15 00:40:21,212 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1864314.6666666667, ans=0.125 2023-10-15 00:40:21,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1864314.6666666667, ans=0.125 2023-10-15 00:40:26,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1864361.3333333333, ans=0.5 2023-10-15 00:41:25,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.81 vs. limit=10.0 2023-10-15 00:41:36,264 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1864641.3333333333, ans=0.0 2023-10-15 00:42:09,758 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1864734.6666666667, ans=0.125 2023-10-15 00:42:17,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.895e+02 2.070e+02 2.280e+02 3.076e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-15 00:42:30,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864828.0, ans=0.1 2023-10-15 00:42:58,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1864921.3333333333, ans=0.125 2023-10-15 00:43:04,083 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1864968.0, ans=0.125 2023-10-15 00:43:19,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1865014.6666666667, ans=0.0 2023-10-15 00:43:23,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1865014.6666666667, ans=0.1 2023-10-15 00:43:32,884 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1865061.3333333333, ans=0.125 2023-10-15 00:43:35,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1865108.0, ans=0.125 2023-10-15 00:43:42,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1865108.0, ans=0.1 2023-10-15 00:43:44,635 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1865108.0, ans=0.125 2023-10-15 00:43:52,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1865154.6666666667, ans=0.05 2023-10-15 00:43:58,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1865201.3333333333, ans=0.025 2023-10-15 00:44:08,502 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1865201.3333333333, ans=0.125 2023-10-15 00:44:14,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.800e+02 1.894e+02 2.079e+02 3.098e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-15 00:44:21,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1865294.6666666667, ans=0.2 2023-10-15 00:44:30,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1865294.6666666667, ans=0.0 2023-10-15 00:44:38,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1865341.3333333333, ans=0.0 2023-10-15 00:45:02,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1865434.6666666667, ans=0.0 2023-10-15 00:45:16,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1865481.3333333333, ans=0.0 2023-10-15 00:45:35,229 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:45:48,906 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=15.0 2023-10-15 00:45:55,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1865621.3333333333, ans=0.125 2023-10-15 00:46:02,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1865668.0, ans=0.125 2023-10-15 00:46:04,777 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=15.0 2023-10-15 00:46:07,484 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1865668.0, ans=0.125 2023-10-15 00:46:14,016 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1865714.6666666667, ans=0.0 2023-10-15 00:46:14,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.904e+02 2.034e+02 2.215e+02 3.907e+02, threshold=4.067e+02, percent-clipped=2.0 2023-10-15 00:46:40,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1865808.0, ans=0.125 2023-10-15 00:46:47,899 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1865854.6666666667, ans=0.125 2023-10-15 00:47:05,881 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1865901.3333333333, ans=0.0 2023-10-15 00:47:35,991 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1866041.3333333333, ans=0.2 2023-10-15 00:47:54,829 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.20 vs. limit=15.0 2023-10-15 00:47:57,737 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1866134.6666666667, ans=0.0 2023-10-15 00:48:06,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.798e+02 1.964e+02 2.087e+02 2.752e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-15 00:48:10,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1866181.3333333333, ans=0.5 2023-10-15 00:48:16,103 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1866228.0, ans=0.125 2023-10-15 00:48:16,849 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1866228.0, ans=0.0 2023-10-15 00:48:16,900 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1866228.0, ans=0.1 2023-10-15 00:48:40,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1866321.3333333333, ans=0.2 2023-10-15 00:48:42,240 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1866321.3333333333, ans=0.0 2023-10-15 00:49:07,686 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:49:12,073 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-10-15 00:49:15,088 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1866461.3333333333, ans=0.125 2023-10-15 00:49:23,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1866508.0, ans=0.0 2023-10-15 00:49:29,615 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.61 vs. limit=10.0 2023-10-15 00:49:36,695 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-15 00:49:50,493 INFO [train.py:1031] (3/4) Epoch 30, batch 4000, loss[loss=0.1918, simple_loss=0.2951, pruned_loss=0.04425, over 16608.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2764, pruned_loss=0.0462, over 28331422.68 frames. ], batch size: 219, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 00:49:59,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.849e+02 1.997e+02 2.108e+02 3.017e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-15 00:50:09,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1866694.6666666667, ans=0.0 2023-10-15 00:50:16,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1866694.6666666667, ans=0.125 2023-10-15 00:50:51,338 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1866881.3333333333, ans=0.0 2023-10-15 00:51:09,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1866928.0, ans=0.04949747468305833 2023-10-15 00:51:28,392 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.17 vs. limit=10.0 2023-10-15 00:51:35,156 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1867021.3333333333, ans=0.125 2023-10-15 00:51:50,723 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1867114.6666666667, ans=0.125 2023-10-15 00:51:51,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.959e+02 2.186e+02 2.389e+02 3.914e+02, threshold=4.373e+02, percent-clipped=0.0 2023-10-15 00:51:56,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1867161.3333333333, ans=0.125 2023-10-15 00:51:57,102 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.41 vs. limit=15.0 2023-10-15 00:52:02,885 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1867161.3333333333, ans=10.0 2023-10-15 00:52:11,875 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1867208.0, ans=0.05 2023-10-15 00:52:11,908 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1867208.0, ans=0.125 2023-10-15 00:53:06,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1867394.6666666667, ans=0.125 2023-10-15 00:53:54,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.849e+02 1.983e+02 2.128e+02 3.216e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-15 00:53:55,124 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1867581.3333333333, ans=0.125 2023-10-15 00:54:02,968 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1867628.0, ans=0.0 2023-10-15 00:54:07,416 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1867628.0, ans=0.125 2023-10-15 00:54:08,683 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.78 vs. limit=15.0 2023-10-15 00:54:14,865 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1867674.6666666667, ans=0.0 2023-10-15 00:55:01,368 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.58 vs. limit=22.5 2023-10-15 00:55:02,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1867861.3333333333, ans=0.125 2023-10-15 00:55:07,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1867908.0, ans=0.125 2023-10-15 00:55:14,216 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1867908.0, ans=0.0 2023-10-15 00:55:27,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1867954.6666666667, ans=0.125 2023-10-15 00:55:35,959 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1868001.3333333333, ans=0.125 2023-10-15 00:55:43,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.871e+02 2.144e+02 2.430e+02 3.741e+02, threshold=4.289e+02, percent-clipped=0.0 2023-10-15 00:56:11,556 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-15 00:56:24,209 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-10-15 00:56:29,984 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=12.0 2023-10-15 00:56:36,334 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1868281.3333333333, ans=0.125 2023-10-15 00:56:58,526 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1868374.6666666667, ans=0.0 2023-10-15 00:57:15,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1868421.3333333333, ans=0.2 2023-10-15 00:57:22,947 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-10-15 00:57:36,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.867e+02 1.995e+02 2.226e+02 3.194e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-15 00:57:45,781 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1868561.3333333333, ans=0.125 2023-10-15 00:57:52,190 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1868608.0, ans=15.0 2023-10-15 00:58:37,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1868748.0, ans=0.0 2023-10-15 00:58:46,032 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1868748.0, ans=0.0 2023-10-15 00:59:09,401 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.07 vs. limit=15.0 2023-10-15 00:59:11,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1868841.3333333333, ans=0.125 2023-10-15 00:59:11,307 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1868841.3333333333, ans=0.2 2023-10-15 00:59:12,316 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-10-15 00:59:27,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1868934.6666666667, ans=0.0 2023-10-15 00:59:28,128 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1868934.6666666667, ans=0.0 2023-10-15 00:59:35,804 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:59:38,037 INFO [train.py:1031] (3/4) Epoch 30, batch 4500, loss[loss=0.1828, simple_loss=0.2774, pruned_loss=0.04414, over 16982.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2767, pruned_loss=0.04602, over 29338238.56 frames. ], batch size: 117, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 00:59:40,724 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.17 vs. limit=15.0 2023-10-15 00:59:44,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.891e+02 2.120e+02 2.274e+02 3.174e+02, threshold=4.239e+02, percent-clipped=0.0 2023-10-15 00:59:55,064 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.43 vs. limit=15.0 2023-10-15 01:00:01,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1869074.6666666667, ans=0.125 2023-10-15 01:00:28,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1869168.0, ans=10.0 2023-10-15 01:00:33,540 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1869168.0, ans=0.0 2023-10-15 01:00:41,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1869214.6666666667, ans=0.125 2023-10-15 01:00:43,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1869214.6666666667, ans=0.1 2023-10-15 01:00:53,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1869261.3333333333, ans=0.125 2023-10-15 01:01:11,041 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:01:30,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1869448.0, ans=0.125 2023-10-15 01:01:33,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1869448.0, ans=0.0 2023-10-15 01:01:34,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.904e+02 2.030e+02 2.206e+02 2.881e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-15 01:01:47,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1869541.3333333333, ans=0.0 2023-10-15 01:02:12,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1869634.6666666667, ans=0.125 2023-10-15 01:02:12,903 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1869634.6666666667, ans=0.125 2023-10-15 01:02:39,669 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-15 01:02:52,929 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1869821.3333333333, ans=0.125 2023-10-15 01:02:53,836 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1869821.3333333333, ans=0.0 2023-10-15 01:02:54,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1869821.3333333333, ans=0.1 2023-10-15 01:03:01,529 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1869821.3333333333, ans=0.0 2023-10-15 01:03:04,056 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1869868.0, ans=0.0 2023-10-15 01:03:14,677 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1869868.0, ans=0.125 2023-10-15 01:03:16,033 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1869914.6666666667, ans=0.04949747468305833 2023-10-15 01:03:19,106 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-10-15 01:03:22,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.997e+02 2.196e+02 2.444e+02 3.253e+02, threshold=4.392e+02, percent-clipped=0.0 2023-10-15 01:03:54,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1870054.6666666667, ans=0.2 2023-10-15 01:04:07,804 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1870101.3333333333, ans=0.0 2023-10-15 01:04:14,158 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870148.0, ans=0.1 2023-10-15 01:04:39,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1870241.3333333333, ans=0.0 2023-10-15 01:04:42,310 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1870288.0, ans=0.125 2023-10-15 01:04:42,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1870288.0, ans=0.5 2023-10-15 01:04:42,387 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:04:49,115 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.63 vs. limit=22.5 2023-10-15 01:04:52,448 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.10 vs. limit=22.5 2023-10-15 01:04:53,245 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1870334.6666666667, ans=0.125 2023-10-15 01:05:06,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1870381.3333333333, ans=0.0 2023-10-15 01:05:12,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.827e+02 1.988e+02 2.163e+02 3.045e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-15 01:06:00,591 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.73 vs. limit=10.0 2023-10-15 01:06:23,081 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=15.0 2023-10-15 01:06:24,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1870661.3333333333, ans=0.125 2023-10-15 01:06:33,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.86 vs. limit=6.0 2023-10-15 01:06:38,333 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1870708.0, ans=0.0 2023-10-15 01:06:43,873 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1870754.6666666667, ans=0.125 2023-10-15 01:06:49,673 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-15 01:07:11,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.852e+02 2.029e+02 2.177e+02 2.771e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-15 01:07:25,226 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1870941.3333333333, ans=0.1 2023-10-15 01:07:38,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1870988.0, ans=0.0 2023-10-15 01:07:45,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1870988.0, ans=0.125 2023-10-15 01:07:48,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870988.0, ans=0.1 2023-10-15 01:08:05,671 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:08:05,702 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1871081.3333333333, ans=0.2 2023-10-15 01:08:15,937 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:08:42,042 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-15 01:08:51,886 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1871268.0, ans=0.1 2023-10-15 01:08:55,101 INFO [train.py:1031] (3/4) Epoch 30, batch 5000, loss[loss=0.1792, simple_loss=0.2771, pruned_loss=0.04063, over 16812.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2765, pruned_loss=0.04592, over 30127415.60 frames. ], batch size: 72, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 01:09:01,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.886e+02 2.036e+02 2.222e+02 3.544e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-15 01:09:09,654 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1871361.3333333333, ans=0.125 2023-10-15 01:09:14,892 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871361.3333333333, ans=0.1 2023-10-15 01:09:45,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1871501.3333333333, ans=0.0 2023-10-15 01:10:24,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-10-15 01:10:34,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1871734.6666666667, ans=0.0 2023-10-15 01:10:44,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1871734.6666666667, ans=0.125 2023-10-15 01:10:55,017 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1871781.3333333333, ans=0.0 2023-10-15 01:10:57,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.965e+02 2.173e+02 2.444e+02 4.363e+02, threshold=4.346e+02, percent-clipped=1.0 2023-10-15 01:11:11,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1871874.6666666667, ans=0.125 2023-10-15 01:11:26,397 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1871921.3333333333, ans=0.1 2023-10-15 01:11:30,314 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1871921.3333333333, ans=0.2 2023-10-15 01:12:06,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1872108.0, ans=0.2 2023-10-15 01:12:15,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1872154.6666666667, ans=0.04949747468305833 2023-10-15 01:12:18,819 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1872154.6666666667, ans=0.125 2023-10-15 01:12:19,684 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1872154.6666666667, ans=0.0 2023-10-15 01:12:25,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.57 vs. limit=6.0 2023-10-15 01:12:40,118 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1872248.0, ans=0.125 2023-10-15 01:12:45,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.873e+02 1.998e+02 2.198e+02 2.761e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-15 01:12:52,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1872294.6666666667, ans=0.2 2023-10-15 01:12:58,810 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.43 vs. limit=15.0 2023-10-15 01:13:05,803 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1872341.3333333333, ans=0.0 2023-10-15 01:13:16,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1872388.0, ans=0.125 2023-10-15 01:13:23,465 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1872434.6666666667, ans=0.1 2023-10-15 01:13:59,817 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1872574.6666666667, ans=0.125 2023-10-15 01:14:07,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1872574.6666666667, ans=0.125 2023-10-15 01:14:23,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1872668.0, ans=0.0 2023-10-15 01:14:39,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-10-15 01:14:40,534 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1872714.6666666667, ans=0.0 2023-10-15 01:14:41,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1872714.6666666667, ans=0.125 2023-10-15 01:14:46,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.856e+02 1.995e+02 2.179e+02 3.417e+02, threshold=3.990e+02, percent-clipped=0.0 2023-10-15 01:14:49,657 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-10-15 01:15:13,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1872808.0, ans=0.125 2023-10-15 01:15:31,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=22.5 2023-10-15 01:15:32,042 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872901.3333333333, ans=0.1 2023-10-15 01:15:52,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-10-15 01:16:18,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1873041.3333333333, ans=0.125 2023-10-15 01:16:39,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1873134.6666666667, ans=0.0 2023-10-15 01:16:40,974 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1873134.6666666667, ans=0.125 2023-10-15 01:16:53,389 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1873181.3333333333, ans=0.0 2023-10-15 01:17:02,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.797e+02 1.984e+02 2.156e+02 2.674e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-15 01:17:11,101 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=22.5 2023-10-15 01:17:40,301 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1873368.0, ans=0.125 2023-10-15 01:17:52,933 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1873414.6666666667, ans=0.125 2023-10-15 01:17:55,566 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-10-15 01:18:11,276 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1873461.3333333333, ans=0.125 2023-10-15 01:18:51,454 INFO [train.py:1031] (3/4) Epoch 30, batch 5500, loss[loss=0.1744, simple_loss=0.2746, pruned_loss=0.03714, over 16845.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2764, pruned_loss=0.04571, over 30750067.83 frames. ], batch size: 93, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 01:18:52,155 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1873648.0, ans=0.125 2023-10-15 01:18:54,263 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1873648.0, ans=0.2 2023-10-15 01:19:00,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.856e+02 2.021e+02 2.243e+02 2.879e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-15 01:19:04,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-10-15 01:19:23,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1873741.3333333333, ans=0.2 2023-10-15 01:19:26,290 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.15 vs. limit=22.5 2023-10-15 01:19:27,320 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.55 vs. limit=22.5 2023-10-15 01:19:29,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1873788.0, ans=0.0 2023-10-15 01:19:49,999 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1873881.3333333333, ans=0.125 2023-10-15 01:19:50,003 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1873881.3333333333, ans=0.0 2023-10-15 01:19:52,878 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1873881.3333333333, ans=0.0 2023-10-15 01:20:17,734 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1873974.6666666667, ans=0.125 2023-10-15 01:20:18,798 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-15 01:20:42,642 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1874068.0, ans=0.125 2023-10-15 01:20:44,676 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1874068.0, ans=0.04949747468305833 2023-10-15 01:21:00,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.817e+02 1.985e+02 2.232e+02 3.003e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-15 01:21:05,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1874161.3333333333, ans=0.2 2023-10-15 01:21:19,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874208.0, ans=0.1 2023-10-15 01:21:27,111 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1874254.6666666667, ans=0.0 2023-10-15 01:21:27,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1874254.6666666667, ans=0.2 2023-10-15 01:21:30,522 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1874254.6666666667, ans=0.125 2023-10-15 01:21:36,415 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874254.6666666667, ans=0.125 2023-10-15 01:21:40,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874301.3333333333, ans=0.1 2023-10-15 01:21:55,521 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-15 01:22:05,025 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1874394.6666666667, ans=0.04949747468305833 2023-10-15 01:22:18,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874441.3333333333, ans=0.125 2023-10-15 01:22:33,710 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=22.5 2023-10-15 01:22:37,521 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1874488.0, ans=0.025 2023-10-15 01:22:39,898 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:22:42,195 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-10-15 01:23:05,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1874581.3333333333, ans=0.1 2023-10-15 01:23:05,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.889e+02 2.052e+02 2.253e+02 2.995e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-15 01:23:16,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1874628.0, ans=0.0 2023-10-15 01:23:24,569 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874674.6666666667, ans=0.1 2023-10-15 01:23:39,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1874721.3333333333, ans=0.0 2023-10-15 01:24:28,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1874908.0, ans=0.125 2023-10-15 01:24:49,898 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1875001.3333333333, ans=0.125 2023-10-15 01:25:05,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1875048.0, ans=0.2 2023-10-15 01:25:11,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.860e+02 1.996e+02 2.141e+02 2.689e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-15 01:25:20,766 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-10-15 01:25:52,383 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1875188.0, ans=0.2 2023-10-15 01:26:17,391 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.72 vs. limit=15.0 2023-10-15 01:26:19,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875328.0, ans=0.1 2023-10-15 01:26:35,666 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=22.5 2023-10-15 01:26:48,085 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1875421.3333333333, ans=0.125 2023-10-15 01:26:57,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1875468.0, ans=0.05 2023-10-15 01:27:11,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1875514.6666666667, ans=15.0 2023-10-15 01:27:19,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.791e+02 1.995e+02 2.136e+02 2.610e+02, threshold=3.990e+02, percent-clipped=0.0 2023-10-15 01:27:22,227 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1875561.3333333333, ans=0.07 2023-10-15 01:27:22,282 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1875561.3333333333, ans=0.125 2023-10-15 01:27:22,562 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-15 01:27:35,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1875608.0, ans=0.2 2023-10-15 01:27:51,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1875654.6666666667, ans=0.125 2023-10-15 01:27:58,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875701.3333333333, ans=0.1 2023-10-15 01:27:58,186 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875701.3333333333, ans=0.1 2023-10-15 01:28:11,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1875748.0, ans=15.0 2023-10-15 01:28:15,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1875748.0, ans=0.125 2023-10-15 01:28:23,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1875794.6666666667, ans=0.0 2023-10-15 01:29:11,001 INFO [train.py:1031] (3/4) Epoch 30, batch 6000, loss[loss=0.1646, simple_loss=0.253, pruned_loss=0.03811, over 15492.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2768, pruned_loss=0.04614, over 31188617.87 frames. ], batch size: 35, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 01:29:11,481 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:29:17,848 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1875981.3333333333, ans=0.04949747468305833 2023-10-15 01:29:24,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.911e+02 2.122e+02 2.327e+02 3.056e+02, threshold=4.245e+02, percent-clipped=0.0 2023-10-15 01:29:25,008 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-15 01:29:44,347 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=12.0 2023-10-15 01:29:58,227 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.85 vs. limit=15.0 2023-10-15 01:30:08,493 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1876168.0, ans=0.1 2023-10-15 01:30:16,393 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.80 vs. limit=22.5 2023-10-15 01:30:22,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.68 vs. limit=22.5 2023-10-15 01:30:38,364 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1876308.0, ans=0.0 2023-10-15 01:30:41,129 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.73 vs. limit=22.5 2023-10-15 01:31:23,204 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.41 vs. limit=22.5 2023-10-15 01:31:25,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.915e+02 2.131e+02 2.443e+02 3.249e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-15 01:31:48,936 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1876588.0, ans=0.125 2023-10-15 01:31:58,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1876588.0, ans=0.125 2023-10-15 01:32:05,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1876634.6666666667, ans=0.2 2023-10-15 01:32:26,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1876728.0, ans=0.0 2023-10-15 01:32:46,772 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1876774.6666666667, ans=0.035 2023-10-15 01:33:03,313 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1876868.0, ans=0.0 2023-10-15 01:33:12,305 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-10-15 01:33:26,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.896e+02 2.061e+02 2.219e+02 4.062e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-15 01:33:31,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1876961.3333333333, ans=0.125 2023-10-15 01:33:32,881 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.59 vs. limit=10.0 2023-10-15 01:33:33,755 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.85 vs. limit=22.5 2023-10-15 01:33:33,793 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=12.0 2023-10-15 01:33:48,034 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1877008.0, ans=0.125 2023-10-15 01:34:17,500 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-15 01:34:44,172 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2023-10-15 01:35:08,783 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1877334.6666666667, ans=0.0 2023-10-15 01:35:31,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1877381.3333333333, ans=0.125 2023-10-15 01:35:34,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.867e+02 2.051e+02 2.298e+02 3.159e+02, threshold=4.102e+02, percent-clipped=0.0 2023-10-15 01:35:50,674 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.29 vs. limit=15.0 2023-10-15 01:36:15,154 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1877568.0, ans=0.125 2023-10-15 01:36:21,169 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-15 01:36:23,027 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1877614.6666666667, ans=0.125 2023-10-15 01:36:34,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1877614.6666666667, ans=0.125 2023-10-15 01:36:45,051 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=1877661.3333333333, ans=6.0 2023-10-15 01:36:50,031 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-10-15 01:36:53,423 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1877661.3333333333, ans=0.1 2023-10-15 01:37:26,762 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.95 vs. limit=15.0 2023-10-15 01:37:42,316 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:37:50,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.831e+02 1.986e+02 2.130e+02 3.000e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-15 01:38:02,362 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1877941.3333333333, ans=0.125 2023-10-15 01:38:05,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.22 vs. limit=22.5 2023-10-15 01:38:27,396 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1877988.0, ans=0.125 2023-10-15 01:38:27,475 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1877988.0, ans=0.0 2023-10-15 01:38:39,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1878034.6666666667, ans=0.125 2023-10-15 01:38:41,386 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1878081.3333333333, ans=0.04949747468305833 2023-10-15 01:38:43,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1878081.3333333333, ans=0.2 2023-10-15 01:39:00,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1878128.0, ans=0.0 2023-10-15 01:39:00,883 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1878128.0, ans=0.0 2023-10-15 01:39:39,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1878268.0, ans=0.1 2023-10-15 01:39:47,743 INFO [train.py:1031] (3/4) Epoch 30, batch 6500, loss[loss=0.185, simple_loss=0.2807, pruned_loss=0.04468, over 16933.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2773, pruned_loss=0.04639, over 31516924.63 frames. ], batch size: 138, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 01:40:02,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1878314.6666666667, ans=0.125 2023-10-15 01:40:06,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.962e+02 2.214e+02 2.471e+02 3.623e+02, threshold=4.428e+02, percent-clipped=0.0 2023-10-15 01:40:07,742 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1878361.3333333333, ans=0.125 2023-10-15 01:40:10,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1878361.3333333333, ans=0.2 2023-10-15 01:40:56,097 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1878501.3333333333, ans=0.125 2023-10-15 01:41:27,123 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.69 vs. limit=15.0 2023-10-15 01:41:36,070 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-10-15 01:41:46,382 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1878688.0, ans=0.025 2023-10-15 01:41:58,497 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1878734.6666666667, ans=0.0 2023-10-15 01:42:20,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.864e+02 1.983e+02 2.198e+02 3.063e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-15 01:42:47,099 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1878921.3333333333, ans=0.125 2023-10-15 01:43:42,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1879154.6666666667, ans=0.5 2023-10-15 01:43:44,084 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.39 vs. limit=15.0 2023-10-15 01:43:53,536 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:44:04,459 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=15.0 2023-10-15 01:44:06,478 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1879248.0, ans=0.1 2023-10-15 01:44:23,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.866e+02 2.013e+02 2.338e+02 3.143e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-15 01:44:47,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1879388.0, ans=0.0 2023-10-15 01:44:52,880 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1879388.0, ans=0.125 2023-10-15 01:44:57,376 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1879388.0, ans=0.2 2023-10-15 01:45:02,237 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1879434.6666666667, ans=0.0 2023-10-15 01:45:29,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1879528.0, ans=0.125 2023-10-15 01:45:32,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-10-15 01:45:38,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1879574.6666666667, ans=0.125 2023-10-15 01:45:41,564 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1879574.6666666667, ans=10.0 2023-10-15 01:45:41,928 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-10-15 01:45:57,043 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1879621.3333333333, ans=0.125 2023-10-15 01:46:05,955 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1879668.0, ans=0.1 2023-10-15 01:46:30,795 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:46:38,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.838e+02 2.004e+02 2.365e+02 4.217e+02, threshold=4.008e+02, percent-clipped=1.0 2023-10-15 01:46:44,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1879761.3333333333, ans=0.0 2023-10-15 01:46:49,696 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1879761.3333333333, ans=0.2 2023-10-15 01:47:00,179 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-15 01:47:08,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1879854.6666666667, ans=0.125 2023-10-15 01:47:31,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1879901.3333333333, ans=0.0 2023-10-15 01:47:38,942 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1879948.0, ans=0.125 2023-10-15 01:47:40,077 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1879948.0, ans=0.125 2023-10-15 01:47:56,715 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-10-15 01:48:00,403 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1880041.3333333333, ans=0.125 2023-10-15 01:48:05,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1880041.3333333333, ans=0.125 2023-10-15 01:48:06,230 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-10-15 01:48:18,945 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1880088.0, ans=0.125 2023-10-15 01:48:23,414 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1880088.0, ans=0.2 2023-10-15 01:48:41,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1880181.3333333333, ans=0.125 2023-10-15 01:48:44,567 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.88 vs. limit=15.0 2023-10-15 01:48:48,110 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1880181.3333333333, ans=0.125 2023-10-15 01:48:58,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.751e+02 1.872e+02 2.083e+02 3.635e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-15 01:49:10,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1880274.6666666667, ans=0.1 2023-10-15 01:49:12,890 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1880274.6666666667, ans=0.125 2023-10-15 01:49:16,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1880274.6666666667, ans=0.125 2023-10-15 01:49:18,626 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-10-15 01:49:22,353 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1880274.6666666667, ans=0.1 2023-10-15 01:49:39,864 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1880368.0, ans=0.125 2023-10-15 01:49:48,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880368.0, ans=0.1 2023-10-15 01:49:57,675 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1880414.6666666667, ans=0.125 2023-10-15 01:50:06,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1880461.3333333333, ans=0.0 2023-10-15 01:50:20,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1880508.0, ans=0.125 2023-10-15 01:50:22,535 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.68 vs. limit=10.0 2023-10-15 01:50:29,920 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-15 01:50:35,215 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1880554.6666666667, ans=0.125 2023-10-15 01:50:49,673 INFO [train.py:1031] (3/4) Epoch 30, batch 7000, loss[loss=0.1759, simple_loss=0.2716, pruned_loss=0.0401, over 16903.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2775, pruned_loss=0.04623, over 31782540.73 frames. ], batch size: 87, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 01:51:06,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.909e+02 2.083e+02 2.272e+02 2.871e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-15 01:51:22,434 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1880741.3333333333, ans=0.125 2023-10-15 01:51:24,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=15.0 2023-10-15 01:51:36,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1880788.0, ans=0.0 2023-10-15 01:51:47,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1880834.6666666667, ans=0.0 2023-10-15 01:51:51,171 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1880881.3333333333, ans=0.125 2023-10-15 01:51:53,116 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1880881.3333333333, ans=0.125 2023-10-15 01:52:08,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1880928.0, ans=0.125 2023-10-15 01:52:11,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1880928.0, ans=0.0 2023-10-15 01:52:30,211 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881021.3333333333, ans=0.1 2023-10-15 01:52:36,914 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-15 01:52:45,187 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1881068.0, ans=15.0 2023-10-15 01:52:52,490 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.80 vs. limit=15.0 2023-10-15 01:53:00,053 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1881161.3333333333, ans=0.125 2023-10-15 01:53:02,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.864e+02 1.989e+02 2.255e+02 2.619e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-15 01:53:03,068 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1881161.3333333333, ans=0.1 2023-10-15 01:53:22,095 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1881208.0, ans=0.2 2023-10-15 01:53:31,205 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1881254.6666666667, ans=0.04949747468305833 2023-10-15 01:53:45,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-10-15 01:53:52,627 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.58 vs. limit=6.0 2023-10-15 01:53:53,466 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1881301.3333333333, ans=0.125 2023-10-15 01:54:01,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1881348.0, ans=0.125 2023-10-15 01:54:29,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1881441.3333333333, ans=0.125 2023-10-15 01:54:46,298 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:55:02,808 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:55:07,593 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1881581.3333333333, ans=0.2 2023-10-15 01:55:17,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1881628.0, ans=0.1 2023-10-15 01:55:23,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.878e+02 2.034e+02 2.237e+02 2.721e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-15 01:55:56,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1881674.6666666667, ans=0.0 2023-10-15 01:56:00,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1881721.3333333333, ans=0.125 2023-10-15 01:56:09,153 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1881721.3333333333, ans=0.0 2023-10-15 01:56:32,470 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.95 vs. limit=15.0 2023-10-15 01:56:34,619 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1881814.6666666667, ans=0.2 2023-10-15 01:57:01,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1881908.0, ans=0.125 2023-10-15 01:57:06,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1881954.6666666667, ans=0.125 2023-10-15 01:57:06,181 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1881954.6666666667, ans=0.125 2023-10-15 01:57:07,090 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1881954.6666666667, ans=0.0 2023-10-15 01:57:17,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1882001.3333333333, ans=0.125 2023-10-15 01:57:23,374 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=12.0 2023-10-15 01:57:38,547 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1882048.0, ans=0.125 2023-10-15 01:57:43,041 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1882094.6666666667, ans=0.125 2023-10-15 01:57:43,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.755e+02 2.001e+02 2.179e+02 2.866e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-15 01:58:07,000 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1882141.3333333333, ans=0.0 2023-10-15 01:58:07,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1882188.0, ans=0.015 2023-10-15 01:58:08,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1882188.0, ans=0.125 2023-10-15 01:58:09,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1882188.0, ans=0.125 2023-10-15 01:58:27,093 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2023-10-15 01:58:30,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1882234.6666666667, ans=0.0 2023-10-15 01:58:32,207 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1882234.6666666667, ans=0.2 2023-10-15 01:58:58,319 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=8.0 2023-10-15 01:59:17,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1882421.3333333333, ans=0.2 2023-10-15 01:59:34,343 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1882468.0, ans=0.025 2023-10-15 01:59:36,668 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1882468.0, ans=0.025 2023-10-15 01:59:47,208 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1882514.6666666667, ans=0.2 2023-10-15 01:59:57,429 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1882514.6666666667, ans=0.125 2023-10-15 02:00:03,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.910e+02 2.113e+02 2.462e+02 3.424e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-15 02:00:30,910 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1882654.6666666667, ans=0.0 2023-10-15 02:01:02,113 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1882794.6666666667, ans=0.0 2023-10-15 02:01:13,733 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1882841.3333333333, ans=15.0 2023-10-15 02:01:20,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882841.3333333333, ans=0.1 2023-10-15 02:01:28,579 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1882888.0, ans=0.0 2023-10-15 02:01:29,897 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=15.0 2023-10-15 02:01:47,553 INFO [train.py:1031] (3/4) Epoch 30, batch 7500, loss[loss=0.2082, simple_loss=0.2963, pruned_loss=0.06007, over 15498.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2773, pruned_loss=0.04623, over 31978667.95 frames. ], batch size: 35, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 02:01:59,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=15.0 2023-10-15 02:02:02,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.922e+02 2.086e+02 2.342e+02 3.697e+02, threshold=4.171e+02, percent-clipped=0.0 2023-10-15 02:02:02,805 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1883028.0, ans=0.05 2023-10-15 02:02:53,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1883214.6666666667, ans=0.0 2023-10-15 02:03:06,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1883261.3333333333, ans=0.2 2023-10-15 02:03:14,614 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1883308.0, ans=0.125 2023-10-15 02:03:22,831 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1883354.6666666667, ans=0.0 2023-10-15 02:03:24,760 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1883354.6666666667, ans=0.0 2023-10-15 02:03:39,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1883401.3333333333, ans=0.05 2023-10-15 02:03:46,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883401.3333333333, ans=0.1 2023-10-15 02:03:47,673 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1883401.3333333333, ans=0.07 2023-10-15 02:03:55,616 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1883448.0, ans=0.125 2023-10-15 02:04:07,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.851e+02 2.004e+02 2.113e+02 3.812e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-15 02:04:16,175 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1883541.3333333333, ans=0.125 2023-10-15 02:04:59,057 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1883634.6666666667, ans=0.5 2023-10-15 02:04:59,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1883634.6666666667, ans=0.0 2023-10-15 02:05:06,790 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1883634.6666666667, ans=0.0 2023-10-15 02:05:25,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1883728.0, ans=0.0 2023-10-15 02:05:33,739 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1883728.0, ans=0.0 2023-10-15 02:05:44,006 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1883774.6666666667, ans=0.125 2023-10-15 02:05:48,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-10-15 02:06:03,137 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1883821.3333333333, ans=0.5 2023-10-15 02:06:08,283 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1883868.0, ans=0.0 2023-10-15 02:06:17,503 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1883914.6666666667, ans=0.0 2023-10-15 02:06:18,942 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.62 vs. limit=15.0 2023-10-15 02:06:23,808 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1883914.6666666667, ans=0.125 2023-10-15 02:06:34,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.871e+02 2.111e+02 2.358e+02 3.300e+02, threshold=4.222e+02, percent-clipped=0.0 2023-10-15 02:06:40,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1883961.3333333333, ans=0.2 2023-10-15 02:07:21,703 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1884148.0, ans=0.015 2023-10-15 02:07:26,683 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1884148.0, ans=0.0 2023-10-15 02:07:54,143 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1884288.0, ans=0.125 2023-10-15 02:08:04,481 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1884334.6666666667, ans=0.1 2023-10-15 02:08:05,494 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1884334.6666666667, ans=0.125 2023-10-15 02:08:10,720 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1884334.6666666667, ans=0.1 2023-10-15 02:08:33,384 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-10-15 02:08:33,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.966e+02 2.143e+02 2.384e+02 3.213e+02, threshold=4.286e+02, percent-clipped=0.0 2023-10-15 02:08:39,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1884428.0, ans=0.05 2023-10-15 02:08:46,809 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1884474.6666666667, ans=0.2 2023-10-15 02:08:47,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1884474.6666666667, ans=0.07 2023-10-15 02:08:51,555 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1884474.6666666667, ans=0.125 2023-10-15 02:08:52,509 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1884474.6666666667, ans=0.125 2023-10-15 02:09:08,678 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-15 02:09:25,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1884568.0, ans=0.125 2023-10-15 02:09:36,504 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1884614.6666666667, ans=0.0 2023-10-15 02:10:08,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1884708.0, ans=0.125 2023-10-15 02:10:12,430 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.23 vs. limit=10.0 2023-10-15 02:10:43,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1884848.0, ans=0.0 2023-10-15 02:10:54,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.840e+02 1.982e+02 2.150e+02 2.940e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-15 02:11:11,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1884941.3333333333, ans=0.0 2023-10-15 02:11:14,007 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1884941.3333333333, ans=0.0 2023-10-15 02:11:35,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1885034.6666666667, ans=0.05 2023-10-15 02:11:37,473 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-10-15 02:11:40,302 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1885034.6666666667, ans=0.125 2023-10-15 02:11:42,887 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1885081.3333333333, ans=0.07 2023-10-15 02:11:49,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1885081.3333333333, ans=0.125 2023-10-15 02:12:14,061 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-10-15 02:12:20,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1885174.6666666667, ans=0.2 2023-10-15 02:12:21,033 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.12 vs. limit=10.0 2023-10-15 02:12:24,278 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1885221.3333333333, ans=0.1 2023-10-15 02:12:36,181 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=15.0 2023-10-15 02:12:41,450 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.86 vs. limit=15.0 2023-10-15 02:12:45,447 INFO [train.py:1031] (3/4) Epoch 30, batch 8000, loss[loss=0.1759, simple_loss=0.2704, pruned_loss=0.04068, over 16835.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2768, pruned_loss=0.04576, over 32184834.25 frames. ], batch size: 188, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 02:13:05,334 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-10-15 02:13:05,818 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1885361.3333333333, ans=0.0 2023-10-15 02:13:07,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.740e+02 1.939e+02 2.204e+02 3.324e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-15 02:13:12,541 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1885361.3333333333, ans=0.2 2023-10-15 02:13:30,623 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2023-10-15 02:13:33,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1885454.6666666667, ans=0.0 2023-10-15 02:13:59,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1885548.0, ans=0.2 2023-10-15 02:14:06,382 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-15 02:14:38,189 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1885688.0, ans=0.125 2023-10-15 02:15:04,010 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1885781.3333333333, ans=0.035 2023-10-15 02:15:11,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.733e+02 1.945e+02 2.110e+02 3.210e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-15 02:15:22,796 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.54 vs. limit=22.5 2023-10-15 02:15:30,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1885874.6666666667, ans=0.125 2023-10-15 02:16:00,017 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.11 vs. limit=12.0 2023-10-15 02:16:01,976 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1886014.6666666667, ans=0.125 2023-10-15 02:16:26,367 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1886061.3333333333, ans=0.125 2023-10-15 02:16:40,453 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1886108.0, ans=0.09899494936611666 2023-10-15 02:16:58,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1886154.6666666667, ans=0.125 2023-10-15 02:17:11,119 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1886201.3333333333, ans=0.125 2023-10-15 02:17:17,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1886201.3333333333, ans=0.5 2023-10-15 02:17:33,439 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:17:36,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.837e+02 1.970e+02 2.156e+02 2.956e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-15 02:18:04,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1886388.0, ans=0.04949747468305833 2023-10-15 02:18:04,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1886388.0, ans=0.125 2023-10-15 02:18:05,913 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1886388.0, ans=0.0 2023-10-15 02:18:20,618 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1886481.3333333333, ans=0.2 2023-10-15 02:18:38,757 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.23 vs. limit=22.5 2023-10-15 02:18:45,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1886574.6666666667, ans=0.0 2023-10-15 02:19:06,379 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1886668.0, ans=0.0 2023-10-15 02:19:13,344 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1886668.0, ans=0.125 2023-10-15 02:19:34,174 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1886761.3333333333, ans=0.0 2023-10-15 02:19:38,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.821e+02 1.980e+02 2.148e+02 2.727e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-15 02:19:55,954 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1886808.0, ans=10.0 2023-10-15 02:20:01,528 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1886854.6666666667, ans=0.125 2023-10-15 02:20:06,966 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2023-10-15 02:20:25,054 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:20:52,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-10-15 02:20:58,108 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886994.6666666667, ans=0.1 2023-10-15 02:21:05,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.81 vs. limit=22.5 2023-10-15 02:21:13,431 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1887088.0, ans=0.0 2023-10-15 02:21:48,710 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1887181.3333333333, ans=0.2 2023-10-15 02:21:50,058 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1887181.3333333333, ans=0.2 2023-10-15 02:22:02,253 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1887228.0, ans=0.0 2023-10-15 02:22:03,743 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.862e+02 2.020e+02 2.213e+02 2.968e+02, threshold=4.041e+02, percent-clipped=0.0 2023-10-15 02:22:05,553 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1887228.0, ans=0.0 2023-10-15 02:22:26,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1887321.3333333333, ans=0.0 2023-10-15 02:22:51,801 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.34 vs. limit=10.0 2023-10-15 02:22:52,568 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1887414.6666666667, ans=0.125 2023-10-15 02:22:58,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1887414.6666666667, ans=0.0 2023-10-15 02:23:40,161 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1887601.3333333333, ans=0.125 2023-10-15 02:23:54,990 INFO [train.py:1031] (3/4) Epoch 30, batch 8500, loss[loss=0.1724, simple_loss=0.272, pruned_loss=0.03635, over 16906.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2772, pruned_loss=0.04575, over 32304037.99 frames. ], batch size: 104, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 02:23:57,140 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-15 02:24:09,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.846e+02 2.000e+02 2.257e+02 2.817e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-15 02:24:15,075 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1887694.6666666667, ans=0.05 2023-10-15 02:24:29,951 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1887788.0, ans=0.1 2023-10-15 02:24:40,764 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1887834.6666666667, ans=0.0 2023-10-15 02:24:44,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.86 vs. limit=22.5 2023-10-15 02:24:50,628 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.09 vs. limit=15.0 2023-10-15 02:25:04,916 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1887928.0, ans=0.05 2023-10-15 02:25:15,548 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1887928.0, ans=0.125 2023-10-15 02:25:21,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1887974.6666666667, ans=0.125 2023-10-15 02:25:25,964 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1887974.6666666667, ans=0.95 2023-10-15 02:25:27,127 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1887974.6666666667, ans=0.0 2023-10-15 02:25:28,741 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=12.0 2023-10-15 02:25:47,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1888068.0, ans=0.125 2023-10-15 02:26:27,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.900e+02 2.108e+02 2.381e+02 3.088e+02, threshold=4.216e+02, percent-clipped=0.0 2023-10-15 02:26:33,971 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1888208.0, ans=0.0 2023-10-15 02:26:41,967 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1888208.0, ans=0.125 2023-10-15 02:26:44,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1888208.0, ans=0.1 2023-10-15 02:26:53,242 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1888254.6666666667, ans=0.1 2023-10-15 02:26:55,575 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-10-15 02:27:02,001 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-15 02:27:18,729 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.24 vs. limit=15.0 2023-10-15 02:27:21,846 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1888301.3333333333, ans=0.2 2023-10-15 02:27:59,074 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=22.5 2023-10-15 02:28:03,828 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.00 vs. limit=15.0 2023-10-15 02:28:08,557 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1888488.0, ans=0.1 2023-10-15 02:28:14,591 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1888488.0, ans=0.2 2023-10-15 02:28:19,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1888488.0, ans=0.2 2023-10-15 02:28:21,145 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=12.0 2023-10-15 02:28:36,597 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1888581.3333333333, ans=0.05 2023-10-15 02:28:51,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.807e+02 1.950e+02 2.229e+02 2.917e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-15 02:28:59,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1888674.6666666667, ans=0.05 2023-10-15 02:29:11,411 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1888721.3333333333, ans=0.125 2023-10-15 02:29:15,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1888721.3333333333, ans=0.1 2023-10-15 02:29:42,847 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-10-15 02:29:56,682 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1888861.3333333333, ans=0.125 2023-10-15 02:30:06,060 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1888908.0, ans=0.125 2023-10-15 02:30:11,827 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1888908.0, ans=0.125 2023-10-15 02:30:16,234 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:30:28,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1888954.6666666667, ans=0.0 2023-10-15 02:30:52,620 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1889048.0, ans=0.125 2023-10-15 02:30:57,296 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889094.6666666667, ans=0.1 2023-10-15 02:30:59,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1889094.6666666667, ans=0.0 2023-10-15 02:31:00,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.771e+02 1.898e+02 2.168e+02 3.539e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-15 02:31:01,641 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1889094.6666666667, ans=0.2 2023-10-15 02:31:15,831 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-15 02:31:24,828 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1889188.0, ans=0.0 2023-10-15 02:32:00,070 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1889281.3333333333, ans=0.07 2023-10-15 02:33:04,380 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:33:05,561 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1889514.6666666667, ans=0.0 2023-10-15 02:33:06,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1889514.6666666667, ans=0.125 2023-10-15 02:33:06,714 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889514.6666666667, ans=0.1 2023-10-15 02:33:22,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.849e+02 1.997e+02 2.248e+02 2.743e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-15 02:33:29,452 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1889608.0, ans=0.04949747468305833 2023-10-15 02:33:58,116 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.55 vs. limit=15.0 2023-10-15 02:34:03,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-10-15 02:34:05,558 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1889748.0, ans=0.125 2023-10-15 02:34:15,196 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1889748.0, ans=0.125 2023-10-15 02:35:03,786 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1889934.6666666667, ans=0.09899494936611666 2023-10-15 02:35:06,127 INFO [train.py:1031] (3/4) Epoch 30, batch 9000, loss[loss=0.1613, simple_loss=0.2654, pruned_loss=0.02854, over 16864.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2767, pruned_loss=0.04568, over 32399296.94 frames. ], batch size: 104, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 02:35:11,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1889981.3333333333, ans=0.125 2023-10-15 02:35:24,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.885e+02 2.044e+02 2.275e+02 2.810e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-15 02:35:36,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1890074.6666666667, ans=0.125 2023-10-15 02:35:54,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=15.0 2023-10-15 02:36:19,268 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1890261.3333333333, ans=0.0 2023-10-15 02:36:33,341 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1890308.0, ans=0.125 2023-10-15 02:36:39,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1890354.6666666667, ans=0.125 2023-10-15 02:37:13,222 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.06 vs. limit=15.0 2023-10-15 02:37:16,695 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1890494.6666666667, ans=0.125 2023-10-15 02:37:21,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.801e+02 1.952e+02 2.112e+02 2.853e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-15 02:37:31,878 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.60 vs. limit=15.0 2023-10-15 02:37:47,091 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1890588.0, ans=0.125 2023-10-15 02:37:49,445 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1890588.0, ans=0.125 2023-10-15 02:37:58,874 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-10-15 02:38:04,018 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1890681.3333333333, ans=0.125 2023-10-15 02:38:05,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-15 02:38:10,876 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.94 vs. limit=12.0 2023-10-15 02:38:12,740 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1890681.3333333333, ans=0.0 2023-10-15 02:38:24,107 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890728.0, ans=0.1 2023-10-15 02:38:25,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1890728.0, ans=0.125 2023-10-15 02:38:30,716 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:38:36,794 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-15 02:38:39,836 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:39:27,492 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1890961.3333333333, ans=0.2 2023-10-15 02:39:28,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1890961.3333333333, ans=0.2 2023-10-15 02:39:28,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-10-15 02:39:29,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.899e+02 2.058e+02 2.341e+02 3.080e+02, threshold=4.117e+02, percent-clipped=0.0 2023-10-15 02:39:34,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891008.0, ans=0.1 2023-10-15 02:39:39,080 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.02 vs. limit=10.0 2023-10-15 02:39:52,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1891054.6666666667, ans=0.125 2023-10-15 02:40:03,156 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-15 02:40:07,866 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1891101.3333333333, ans=0.125 2023-10-15 02:40:29,252 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1891194.6666666667, ans=0.125 2023-10-15 02:41:08,775 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1891381.3333333333, ans=0.07 2023-10-15 02:41:11,658 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1891381.3333333333, ans=0.125 2023-10-15 02:41:20,587 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1891428.0, ans=0.1 2023-10-15 02:41:25,773 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1891428.0, ans=0.125 2023-10-15 02:41:26,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.951e+02 2.110e+02 2.368e+02 3.117e+02, threshold=4.220e+02, percent-clipped=0.0 2023-10-15 02:41:39,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1891474.6666666667, ans=0.0 2023-10-15 02:41:58,082 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-10-15 02:41:59,084 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:42:10,988 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1891614.6666666667, ans=0.125 2023-10-15 02:42:14,280 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-10-15 02:42:22,924 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1891614.6666666667, ans=0.125 2023-10-15 02:42:36,679 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1891661.3333333333, ans=0.2 2023-10-15 02:42:53,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1891754.6666666667, ans=0.0 2023-10-15 02:43:06,421 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1891801.3333333333, ans=0.2 2023-10-15 02:43:16,513 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1891801.3333333333, ans=0.2 2023-10-15 02:43:21,989 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1891848.0, ans=0.2 2023-10-15 02:43:44,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.861e+02 2.112e+02 2.458e+02 2.977e+02, threshold=4.224e+02, percent-clipped=0.0 2023-10-15 02:44:12,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1891988.0, ans=0.2 2023-10-15 02:44:18,532 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1892034.6666666667, ans=0.125 2023-10-15 02:44:32,407 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1892034.6666666667, ans=0.125 2023-10-15 02:44:35,660 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1892081.3333333333, ans=0.125 2023-10-15 02:44:37,760 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.60 vs. limit=15.0 2023-10-15 02:44:38,533 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892081.3333333333, ans=0.1 2023-10-15 02:44:38,706 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1892081.3333333333, ans=0.2 2023-10-15 02:44:42,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1892081.3333333333, ans=0.125 2023-10-15 02:44:51,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1892128.0, ans=0.0 2023-10-15 02:44:51,485 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-15 02:45:08,220 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1892174.6666666667, ans=0.0 2023-10-15 02:45:08,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1892174.6666666667, ans=0.0 2023-10-15 02:45:41,611 INFO [train.py:1031] (3/4) Epoch 30, batch 9500, loss[loss=0.1931, simple_loss=0.2819, pruned_loss=0.05221, over 16883.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2775, pruned_loss=0.04596, over 32481880.51 frames. ], batch size: 110, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 02:45:44,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1892314.6666666667, ans=0.125 2023-10-15 02:46:00,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.892e+02 2.036e+02 2.285e+02 3.400e+02, threshold=4.072e+02, percent-clipped=0.0 2023-10-15 02:46:03,219 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1892361.3333333333, ans=0.0 2023-10-15 02:46:09,125 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-10-15 02:46:13,662 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:46:13,748 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:46:15,914 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1892454.6666666667, ans=0.125 2023-10-15 02:46:24,816 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.63 vs. limit=15.0 2023-10-15 02:46:37,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=15.0 2023-10-15 02:46:44,926 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1892548.0, ans=0.125 2023-10-15 02:46:48,790 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:47:14,583 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.80 vs. limit=10.0 2023-10-15 02:47:15,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1892641.3333333333, ans=0.1 2023-10-15 02:47:39,309 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1892734.6666666667, ans=0.125 2023-10-15 02:48:02,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.922e+02 2.150e+02 2.355e+02 3.542e+02, threshold=4.300e+02, percent-clipped=0.0 2023-10-15 02:48:04,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-10-15 02:48:20,721 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1892921.3333333333, ans=0.1 2023-10-15 02:48:24,046 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1892921.3333333333, ans=0.0 2023-10-15 02:48:41,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1892968.0, ans=0.125 2023-10-15 02:48:51,704 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:49:13,597 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-10-15 02:49:14,391 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1893108.0, ans=0.0 2023-10-15 02:49:17,689 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1893108.0, ans=0.0 2023-10-15 02:49:20,315 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1893108.0, ans=0.07 2023-10-15 02:49:37,872 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:50:11,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.827e+02 1.965e+02 2.130e+02 3.478e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-15 02:50:11,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1893294.6666666667, ans=0.125 2023-10-15 02:50:20,233 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1893341.3333333333, ans=0.1 2023-10-15 02:50:21,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1893341.3333333333, ans=0.0 2023-10-15 02:50:26,835 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1893341.3333333333, ans=0.2 2023-10-15 02:50:54,893 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1893481.3333333333, ans=0.125 2023-10-15 02:51:14,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1893528.0, ans=0.125 2023-10-15 02:51:14,904 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.49 vs. limit=15.0 2023-10-15 02:51:22,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1893574.6666666667, ans=0.125 2023-10-15 02:51:26,905 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-10-15 02:51:32,832 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1893621.3333333333, ans=0.0 2023-10-15 02:51:48,813 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1893668.0, ans=0.125 2023-10-15 02:52:06,206 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1893714.6666666667, ans=0.09899494936611666 2023-10-15 02:52:13,134 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1893761.3333333333, ans=0.125 2023-10-15 02:52:19,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.794e+02 1.956e+02 2.062e+02 2.960e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-15 02:52:35,347 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1893854.6666666667, ans=0.5 2023-10-15 02:52:53,050 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1893901.3333333333, ans=0.2 2023-10-15 02:52:57,582 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1893901.3333333333, ans=0.05 2023-10-15 02:53:03,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1893948.0, ans=0.0 2023-10-15 02:53:08,932 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893948.0, ans=0.1 2023-10-15 02:53:16,176 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1893994.6666666667, ans=0.035 2023-10-15 02:53:38,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1894088.0, ans=0.2 2023-10-15 02:53:41,947 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1894088.0, ans=0.125 2023-10-15 02:53:43,274 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894088.0, ans=0.125 2023-10-15 02:54:14,512 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=22.5 2023-10-15 02:54:18,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.800e+02 1.923e+02 2.098e+02 2.819e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-15 02:54:38,026 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-15 02:54:46,998 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=12.0 2023-10-15 02:55:07,204 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1894414.6666666667, ans=0.125 2023-10-15 02:55:14,837 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1894461.3333333333, ans=0.0 2023-10-15 02:55:24,117 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1894508.0, ans=0.125 2023-10-15 02:55:39,706 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:55:41,098 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1894554.6666666667, ans=0.125 2023-10-15 02:55:42,037 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1894554.6666666667, ans=0.5 2023-10-15 02:55:55,361 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1894601.3333333333, ans=0.125 2023-10-15 02:56:01,217 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1894648.0, ans=0.125 2023-10-15 02:56:01,986 INFO [train.py:1031] (3/4) Epoch 30, batch 10000, loss[loss=0.1971, simple_loss=0.2917, pruned_loss=0.05123, over 16899.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2767, pruned_loss=0.04571, over 32519297.73 frames. ], batch size: 138, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 02:56:21,767 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.877e+02 2.028e+02 2.256e+02 3.121e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-15 02:57:23,965 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.18 vs. limit=22.5 2023-10-15 02:57:41,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1894974.6666666667, ans=0.0 2023-10-15 02:57:42,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1894974.6666666667, ans=0.125 2023-10-15 02:58:00,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1895068.0, ans=0.0 2023-10-15 02:58:14,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895114.6666666667, ans=0.1 2023-10-15 02:58:38,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.908e+02 2.115e+02 2.307e+02 3.069e+02, threshold=4.230e+02, percent-clipped=0.0 2023-10-15 02:59:13,092 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1895301.3333333333, ans=0.0 2023-10-15 02:59:22,061 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1895348.0, ans=0.2 2023-10-15 02:59:42,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1895441.3333333333, ans=0.125 2023-10-15 02:59:49,394 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1895488.0, ans=0.125 2023-10-15 03:00:37,356 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1895628.0, ans=0.0 2023-10-15 03:00:38,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.850e+02 2.085e+02 2.326e+02 2.900e+02, threshold=4.171e+02, percent-clipped=0.0 2023-10-15 03:00:44,637 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1895674.6666666667, ans=0.125 2023-10-15 03:00:52,355 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-10-15 03:01:02,254 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1895721.3333333333, ans=0.0 2023-10-15 03:01:25,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1895814.6666666667, ans=0.0 2023-10-15 03:01:29,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1895814.6666666667, ans=0.125 2023-10-15 03:01:42,863 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:02:02,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1895908.0, ans=0.125 2023-10-15 03:02:08,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1895954.6666666667, ans=0.125 2023-10-15 03:03:02,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.904e+02 2.059e+02 2.288e+02 3.525e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-15 03:03:06,619 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=22.5 2023-10-15 03:03:07,753 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1896141.3333333333, ans=0.125 2023-10-15 03:03:25,136 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1896188.0, ans=0.125 2023-10-15 03:03:44,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-10-15 03:03:50,993 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896281.3333333333, ans=0.1 2023-10-15 03:04:05,992 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-15 03:04:07,655 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1896328.0, ans=0.2 2023-10-15 03:04:12,838 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1896374.6666666667, ans=0.2 2023-10-15 03:04:27,688 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1896421.3333333333, ans=0.125 2023-10-15 03:04:28,680 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:04:33,907 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1896421.3333333333, ans=0.125 2023-10-15 03:04:42,355 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1896468.0, ans=0.125 2023-10-15 03:05:02,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1896561.3333333333, ans=0.0 2023-10-15 03:05:04,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1896561.3333333333, ans=0.2 2023-10-15 03:05:11,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.868e+02 2.044e+02 2.228e+02 3.238e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-15 03:05:40,871 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1896701.3333333333, ans=0.125 2023-10-15 03:06:15,724 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1896794.6666666667, ans=0.0 2023-10-15 03:06:16,710 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:06:23,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1896841.3333333333, ans=0.125 2023-10-15 03:06:36,173 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-15 03:06:49,801 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1896934.6666666667, ans=0.2 2023-10-15 03:06:50,732 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1896981.3333333333, ans=0.125 2023-10-15 03:06:51,506 INFO [train.py:1031] (3/4) Epoch 30, batch 10500, loss[loss=0.1742, simple_loss=0.2729, pruned_loss=0.03775, over 16922.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2772, pruned_loss=0.0457, over 32600174.69 frames. ], batch size: 82, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 03:07:04,576 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1897028.0, ans=0.5 2023-10-15 03:07:07,365 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1897028.0, ans=0.2 2023-10-15 03:07:12,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 1.831e+02 1.990e+02 2.167e+02 2.921e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-15 03:07:21,299 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1897074.6666666667, ans=0.125 2023-10-15 03:07:27,266 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1897121.3333333333, ans=0.125 2023-10-15 03:07:33,076 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1897121.3333333333, ans=0.125 2023-10-15 03:07:57,424 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1897214.6666666667, ans=0.0 2023-10-15 03:08:14,640 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1897261.3333333333, ans=0.0 2023-10-15 03:08:17,700 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1897261.3333333333, ans=0.1 2023-10-15 03:08:30,971 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-15 03:09:41,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.973e+02 2.109e+02 2.347e+02 3.033e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-15 03:10:05,725 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1897588.0, ans=0.1 2023-10-15 03:10:33,286 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897681.3333333333, ans=0.1 2023-10-15 03:10:37,495 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1897681.3333333333, ans=0.125 2023-10-15 03:11:37,464 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1897868.0, ans=0.0 2023-10-15 03:11:45,879 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1897914.6666666667, ans=0.125 2023-10-15 03:11:56,565 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:12:01,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.896e+02 2.026e+02 2.214e+02 3.195e+02, threshold=4.052e+02, percent-clipped=0.0 2023-10-15 03:13:02,990 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1898194.6666666667, ans=0.0 2023-10-15 03:13:10,755 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1898241.3333333333, ans=0.125 2023-10-15 03:13:13,923 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:13:16,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1898241.3333333333, ans=0.125 2023-10-15 03:13:31,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-15 03:14:11,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.913e+02 2.120e+02 2.407e+02 3.969e+02, threshold=4.240e+02, percent-clipped=0.0 2023-10-15 03:14:17,295 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1898474.6666666667, ans=0.025 2023-10-15 03:14:27,288 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.35 vs. limit=15.0 2023-10-15 03:14:30,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1898521.3333333333, ans=0.125 2023-10-15 03:15:12,275 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898614.6666666667, ans=0.1 2023-10-15 03:15:17,902 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1898661.3333333333, ans=0.125 2023-10-15 03:15:33,674 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1898708.0, ans=0.0 2023-10-15 03:15:38,375 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1898708.0, ans=0.05 2023-10-15 03:15:41,795 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=22.5 2023-10-15 03:16:12,291 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1898801.3333333333, ans=0.0 2023-10-15 03:16:19,511 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1898848.0, ans=0.125 2023-10-15 03:16:45,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.759e+02 1.949e+02 2.162e+02 3.759e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-15 03:16:54,230 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1898941.3333333333, ans=0.125 2023-10-15 03:16:54,371 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2023-10-15 03:17:23,131 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899034.6666666667, ans=0.1 2023-10-15 03:17:24,209 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1899034.6666666667, ans=0.1 2023-10-15 03:17:29,877 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.06 vs. limit=10.0 2023-10-15 03:17:35,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-15 03:17:35,821 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.66 vs. limit=15.0 2023-10-15 03:17:36,586 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:18:03,542 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-15 03:18:35,806 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899268.0, ans=0.1 2023-10-15 03:18:41,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899268.0, ans=0.0 2023-10-15 03:18:42,001 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1899268.0, ans=0.125 2023-10-15 03:18:42,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1899268.0, ans=0.125 2023-10-15 03:18:44,203 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-15 03:18:45,992 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899314.6666666667, ans=0.1 2023-10-15 03:18:47,020 INFO [train.py:1031] (3/4) Epoch 30, batch 11000, loss[loss=0.1849, simple_loss=0.2766, pruned_loss=0.04658, over 16898.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2771, pruned_loss=0.04569, over 32639405.25 frames. ], batch size: 77, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 03:19:00,320 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1899361.3333333333, ans=0.125 2023-10-15 03:19:03,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1899361.3333333333, ans=0.0 2023-10-15 03:19:14,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.888e+02 1.999e+02 2.288e+02 2.873e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-15 03:19:45,354 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:19:45,462 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899501.3333333333, ans=0.1 2023-10-15 03:19:46,978 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1899501.3333333333, ans=0.025 2023-10-15 03:20:04,519 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1899548.0, ans=0.125 2023-10-15 03:20:54,996 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899734.6666666667, ans=0.1 2023-10-15 03:20:56,708 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1899734.6666666667, ans=0.125 2023-10-15 03:21:02,518 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-10-15 03:21:17,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1899781.3333333333, ans=0.0 2023-10-15 03:21:30,144 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1899828.0, ans=0.125 2023-10-15 03:21:46,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.901e+02 2.059e+02 2.325e+02 3.331e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-15 03:21:46,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1899874.6666666667, ans=0.125 2023-10-15 03:21:58,623 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1899874.6666666667, ans=0.09899494936611666 2023-10-15 03:22:09,870 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1899921.3333333333, ans=0.0 2023-10-15 03:22:19,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1899921.3333333333, ans=0.125 2023-10-15 03:23:03,884 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=12.0 2023-10-15 03:23:41,743 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.81 vs. limit=10.0 2023-10-15 03:23:47,169 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1900154.6666666667, ans=0.125 2023-10-15 03:24:02,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1900201.3333333333, ans=0.125 2023-10-15 03:24:04,894 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:24:09,062 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.58 vs. limit=22.5 2023-10-15 03:24:14,570 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1900248.0, ans=0.125 2023-10-15 03:24:19,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1900294.6666666667, ans=0.0 2023-10-15 03:24:28,750 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1900294.6666666667, ans=0.09899494936611666 2023-10-15 03:24:33,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.743e+02 1.887e+02 2.069e+02 2.808e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-15 03:24:50,756 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900341.3333333333, ans=0.1 2023-10-15 03:25:04,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1900388.0, ans=0.0 2023-10-15 03:25:08,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900434.6666666667, ans=0.1 2023-10-15 03:25:13,293 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1900434.6666666667, ans=0.0 2023-10-15 03:25:29,133 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1900481.3333333333, ans=0.0 2023-10-15 03:25:35,747 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1900528.0, ans=0.0 2023-10-15 03:25:35,780 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1900528.0, ans=0.125 2023-10-15 03:25:45,287 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1900528.0, ans=0.125 2023-10-15 03:25:46,744 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1900528.0, ans=0.125 2023-10-15 03:26:03,928 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1900621.3333333333, ans=0.0 2023-10-15 03:26:32,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1900714.6666666667, ans=0.0 2023-10-15 03:26:44,665 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1900714.6666666667, ans=0.125 2023-10-15 03:27:01,773 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-15 03:27:04,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.846e+02 2.049e+02 2.325e+02 2.789e+02, threshold=4.098e+02, percent-clipped=0.0 2023-10-15 03:27:35,239 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1900901.3333333333, ans=0.125 2023-10-15 03:28:06,549 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1900994.6666666667, ans=0.2 2023-10-15 03:28:37,499 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1901088.0, ans=0.2 2023-10-15 03:28:43,426 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1901088.0, ans=0.125 2023-10-15 03:28:48,216 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=22.5 2023-10-15 03:29:12,711 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=12.0 2023-10-15 03:29:16,615 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1901181.3333333333, ans=0.125 2023-10-15 03:29:25,964 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.73 vs. limit=15.0 2023-10-15 03:29:44,419 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1901274.6666666667, ans=0.125 2023-10-15 03:29:44,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.880e+02 2.006e+02 2.240e+02 3.317e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-15 03:30:06,716 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1901321.3333333333, ans=0.125 2023-10-15 03:30:09,860 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1901321.3333333333, ans=0.125 2023-10-15 03:30:54,506 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1901414.6666666667, ans=0.125 2023-10-15 03:30:57,811 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1901461.3333333333, ans=0.125 2023-10-15 03:31:00,560 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1901461.3333333333, ans=0.125 2023-10-15 03:31:42,232 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1901601.3333333333, ans=0.07 2023-10-15 03:31:46,594 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-10-15 03:31:57,154 INFO [train.py:1031] (3/4) Epoch 30, batch 11500, loss[loss=0.1814, simple_loss=0.2807, pruned_loss=0.04103, over 16886.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2769, pruned_loss=0.04567, over 32647989.87 frames. ], batch size: 165, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 03:32:27,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1901694.6666666667, ans=0.125 2023-10-15 03:32:30,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.992e+02 2.212e+02 2.432e+02 3.211e+02, threshold=4.424e+02, percent-clipped=0.0 2023-10-15 03:32:39,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1901741.3333333333, ans=0.125 2023-10-15 03:32:49,823 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1901788.0, ans=0.0 2023-10-15 03:33:06,387 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1901834.6666666667, ans=0.0 2023-10-15 03:33:10,787 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-10-15 03:33:29,983 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-10-15 03:33:36,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1901928.0, ans=0.125 2023-10-15 03:33:40,476 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-10-15 03:33:42,433 INFO [scaling.py:979] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.54 vs. limit=5.0 2023-10-15 03:33:45,678 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1901974.6666666667, ans=0.2 2023-10-15 03:33:49,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1901974.6666666667, ans=0.125 2023-10-15 03:33:55,770 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1901974.6666666667, ans=0.125 2023-10-15 03:34:01,927 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1902021.3333333333, ans=0.2 2023-10-15 03:34:17,602 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-15 03:34:18,797 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1902068.0, ans=0.125 2023-10-15 03:34:44,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1902114.6666666667, ans=0.2 2023-10-15 03:34:52,707 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1902161.3333333333, ans=0.0 2023-10-15 03:34:53,643 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1902161.3333333333, ans=0.09899494936611666 2023-10-15 03:35:03,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.839e+02 2.030e+02 2.267e+02 3.028e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-15 03:35:07,373 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1902208.0, ans=0.0 2023-10-15 03:35:47,580 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1902301.3333333333, ans=0.125 2023-10-15 03:35:53,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1902348.0, ans=0.2 2023-10-15 03:35:57,627 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1902348.0, ans=0.125 2023-10-15 03:36:43,463 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1902441.3333333333, ans=0.125 2023-10-15 03:36:47,272 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1902488.0, ans=0.09899494936611666 2023-10-15 03:36:48,518 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1902488.0, ans=0.0 2023-10-15 03:37:22,067 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1902534.6666666667, ans=0.125 2023-10-15 03:37:31,638 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1902581.3333333333, ans=0.125 2023-10-15 03:37:41,120 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1902581.3333333333, ans=0.125 2023-10-15 03:37:50,986 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1902628.0, ans=0.0 2023-10-15 03:37:53,322 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:38:04,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.804e+02 1.942e+02 2.177e+02 3.091e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-15 03:38:05,332 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1902674.6666666667, ans=0.0 2023-10-15 03:38:22,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1902721.3333333333, ans=0.125 2023-10-15 03:38:30,693 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1902721.3333333333, ans=0.125 2023-10-15 03:39:35,336 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1902861.3333333333, ans=0.2 2023-10-15 03:39:50,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1902861.3333333333, ans=0.0 2023-10-15 03:39:50,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.78 vs. limit=10.0 2023-10-15 03:39:54,012 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1902908.0, ans=0.2 2023-10-15 03:40:25,965 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1902954.6666666667, ans=0.0 2023-10-15 03:40:34,980 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1902954.6666666667, ans=0.2 2023-10-15 03:41:07,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1903001.3333333333, ans=0.0 2023-10-15 03:41:49,842 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1903048.0, ans=0.125 2023-10-15 03:42:30,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.876e+02 2.081e+02 2.298e+02 3.208e+02, threshold=4.162e+02, percent-clipped=0.0 2023-10-15 03:42:49,805 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-15 03:43:25,556 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1903188.0, ans=0.0 2023-10-15 03:43:44,071 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1903234.6666666667, ans=0.2 2023-10-15 03:43:48,802 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1903234.6666666667, ans=10.0 2023-10-15 03:44:01,089 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1903281.3333333333, ans=0.125 2023-10-15 03:44:05,323 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1903281.3333333333, ans=0.125 2023-10-15 03:45:22,182 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1903421.3333333333, ans=0.2 2023-10-15 03:45:23,512 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1903421.3333333333, ans=0.125 2023-10-15 03:45:34,280 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1903468.0, ans=0.07 2023-10-15 03:45:38,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1903468.0, ans=0.125 2023-10-15 03:46:16,977 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1903514.6666666667, ans=0.125 2023-10-15 03:46:16,983 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1903514.6666666667, ans=0.0 2023-10-15 03:46:49,360 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1903561.3333333333, ans=0.125 2023-10-15 03:46:53,357 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:47:01,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.893e+02 2.098e+02 2.406e+02 4.118e+02, threshold=4.196e+02, percent-clipped=0.0 2023-10-15 03:47:03,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1903608.0, ans=0.2 2023-10-15 03:47:14,172 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=1903608.0, ans=15.0 2023-10-15 03:47:48,303 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1903701.3333333333, ans=0.125 2023-10-15 03:48:03,257 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1903701.3333333333, ans=0.0 2023-10-15 03:48:25,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1903748.0, ans=0.2 2023-10-15 03:48:28,192 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.73 vs. limit=12.0 2023-10-15 03:48:36,252 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.57 vs. limit=10.0 2023-10-15 03:49:04,093 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1903841.3333333333, ans=0.125 2023-10-15 03:50:11,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1903934.6666666667, ans=0.125 2023-10-15 03:50:25,103 INFO [train.py:1031] (3/4) Epoch 30, batch 12000, loss[loss=0.1775, simple_loss=0.2807, pruned_loss=0.03712, over 16841.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2768, pruned_loss=0.04543, over 32679873.75 frames. ], batch size: 98, lr: 1.13e-03, grad_scale: 32.0 2023-10-15 03:50:29,342 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1903981.3333333333, ans=0.95 2023-10-15 03:51:19,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.815e+02 2.019e+02 2.264e+02 3.625e+02, threshold=4.038e+02, percent-clipped=0.0 2023-10-15 03:51:31,232 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-10-15 03:51:58,412 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1904121.3333333333, ans=0.1 2023-10-15 03:52:36,667 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1904168.0, ans=0.125 2023-10-15 03:52:51,468 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1904214.6666666667, ans=0.125 2023-10-15 03:52:55,324 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1904214.6666666667, ans=0.0 2023-10-15 03:53:09,064 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1904261.3333333333, ans=0.1 2023-10-15 03:53:29,630 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2023-10-15 03:53:34,915 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1904308.0, ans=0.0 2023-10-15 03:53:40,657 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1904308.0, ans=0.2 2023-10-15 03:53:51,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1904354.6666666667, ans=0.125 2023-10-15 03:54:05,240 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.48 vs. limit=15.0 2023-10-15 03:54:05,410 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.41 vs. limit=10.0 2023-10-15 03:54:35,589 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1904494.6666666667, ans=0.125 2023-10-15 03:54:45,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.793e+02 1.980e+02 2.129e+02 2.949e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-15 03:54:51,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1904541.3333333333, ans=0.04949747468305833 2023-10-15 03:54:57,874 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1904588.0, ans=0.125 2023-10-15 03:55:19,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1904681.3333333333, ans=0.125 2023-10-15 03:55:21,456 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1904681.3333333333, ans=0.015 2023-10-15 03:55:27,706 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-15 03:55:47,567 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1904774.6666666667, ans=0.125 2023-10-15 03:55:56,005 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-15 03:56:11,148 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1904868.0, ans=0.0 2023-10-15 03:56:13,796 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1904868.0, ans=0.125 2023-10-15 03:56:13,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1904868.0, ans=0.09899494936611666 2023-10-15 03:56:13,904 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1904868.0, ans=0.0 2023-10-15 03:56:14,894 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1904914.6666666667, ans=0.125 2023-10-15 03:56:23,294 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1904914.6666666667, ans=0.0 2023-10-15 03:56:23,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-10-15 03:56:41,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.921e+02 2.102e+02 2.274e+02 3.124e+02, threshold=4.204e+02, percent-clipped=0.0 2023-10-15 03:56:47,404 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1905008.0, ans=0.07 2023-10-15 03:56:50,311 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:56:52,109 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:56:52,985 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:57:00,063 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:57:17,692 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1905148.0, ans=0.0 2023-10-15 03:57:38,929 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=22.5 2023-10-15 03:57:42,553 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:57:44,288 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:57:45,444 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1905241.3333333333, ans=0.125 2023-10-15 03:57:45,654 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=22.5 2023-10-15 03:57:47,478 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-10-15 03:58:03,184 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1905334.6666666667, ans=0.125 2023-10-15 03:58:08,140 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1905334.6666666667, ans=0.125 2023-10-15 03:58:11,331 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-15 03:58:20,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1905428.0, ans=0.125 2023-10-15 03:58:34,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.929e+02 2.061e+02 2.264e+02 5.478e+02, threshold=4.122e+02, percent-clipped=1.0 2023-10-15 03:58:47,638 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-15 03:58:53,443 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1905521.3333333333, ans=0.05 2023-10-15 03:59:23,334 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-15 03:59:30,427 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1905661.3333333333, ans=0.1 2023-10-15 03:59:30,459 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1905661.3333333333, ans=0.125 2023-10-15 03:59:36,940 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1905708.0, ans=0.125 2023-10-15 03:59:36,994 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1905708.0, ans=0.1 2023-10-15 03:59:48,778 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.31 vs. limit=15.0 2023-10-15 03:59:50,520 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1905754.6666666667, ans=0.125 2023-10-15 04:00:35,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1905894.6666666667, ans=0.07 2023-10-15 04:00:40,104 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905894.6666666667, ans=0.1 2023-10-15 04:00:44,277 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1905941.3333333333, ans=0.125 2023-10-15 04:00:44,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.888e+02 2.057e+02 2.197e+02 2.884e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-15 04:00:57,565 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.80 vs. limit=15.0 2023-10-15 04:00:59,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1905988.0, ans=0.0 2023-10-15 04:01:03,271 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:01:34,372 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1906128.0, ans=0.2 2023-10-15 04:01:35,728 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1906128.0, ans=0.125 2023-10-15 04:01:50,924 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=15.0 2023-10-15 04:02:20,607 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1906314.6666666667, ans=0.0 2023-10-15 04:02:22,051 INFO [train.py:1031] (3/4) Epoch 30, batch 12500, loss[loss=0.1985, simple_loss=0.2955, pruned_loss=0.05071, over 16735.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2768, pruned_loss=0.04565, over 32684766.71 frames. ], batch size: 202, lr: 1.13e-03, grad_scale: 32.0 2023-10-15 04:02:35,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1906361.3333333333, ans=0.0 2023-10-15 04:02:36,530 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1906361.3333333333, ans=0.2 2023-10-15 04:02:37,480 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1906361.3333333333, ans=0.0 2023-10-15 04:02:37,482 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1906361.3333333333, ans=0.2 2023-10-15 04:02:40,378 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1906361.3333333333, ans=0.0 2023-10-15 04:02:45,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.864e+02 1.980e+02 2.123e+02 2.771e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-15 04:03:24,420 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.35 vs. limit=22.5 2023-10-15 04:03:45,918 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1906641.3333333333, ans=0.2 2023-10-15 04:04:01,909 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1906734.6666666667, ans=0.1 2023-10-15 04:04:07,853 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1906734.6666666667, ans=0.125 2023-10-15 04:04:13,457 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.43 vs. limit=6.0 2023-10-15 04:04:19,205 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2023-10-15 04:04:22,653 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1906781.3333333333, ans=0.1 2023-10-15 04:04:28,944 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1906828.0, ans=0.1 2023-10-15 04:04:38,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.822e+02 2.011e+02 2.232e+02 2.990e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-15 04:04:56,608 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.34 vs. limit=15.0 2023-10-15 04:04:59,321 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1906968.0, ans=0.125 2023-10-15 04:05:08,617 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1906968.0, ans=0.015 2023-10-15 04:05:14,906 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1907014.6666666667, ans=0.125 2023-10-15 04:05:33,393 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1907061.3333333333, ans=0.125 2023-10-15 04:05:39,937 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1907108.0, ans=0.125 2023-10-15 04:05:45,430 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1907108.0, ans=0.125 2023-10-15 04:06:17,636 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1907248.0, ans=0.125 2023-10-15 04:06:21,029 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-15 04:06:32,610 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1907294.6666666667, ans=0.1 2023-10-15 04:06:38,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.845e+02 1.981e+02 2.125e+02 2.952e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-15 04:06:59,094 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1907434.6666666667, ans=0.0 2023-10-15 04:07:03,744 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=15.0 2023-10-15 04:07:14,600 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1907481.3333333333, ans=0.125 2023-10-15 04:08:07,010 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.75 vs. limit=22.5 2023-10-15 04:08:07,487 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1907668.0, ans=0.125 2023-10-15 04:08:17,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1907714.6666666667, ans=0.125 2023-10-15 04:08:20,384 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1907714.6666666667, ans=0.125 2023-10-15 04:08:27,165 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.82 vs. limit=15.0 2023-10-15 04:08:38,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.853e+02 2.074e+02 2.290e+02 3.022e+02, threshold=4.147e+02, percent-clipped=0.0 2023-10-15 04:08:51,852 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1907854.6666666667, ans=0.125 2023-10-15 04:09:17,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1907948.0, ans=0.2 2023-10-15 04:09:20,975 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1907948.0, ans=0.0 2023-10-15 04:09:39,086 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1908041.3333333333, ans=0.125 2023-10-15 04:09:53,603 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1908088.0, ans=0.0 2023-10-15 04:09:53,691 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1908088.0, ans=0.125 2023-10-15 04:09:53,743 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1908088.0, ans=0.0 2023-10-15 04:09:58,651 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2023-10-15 04:10:13,244 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1908181.3333333333, ans=0.125 2023-10-15 04:10:21,505 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1908181.3333333333, ans=0.07 2023-10-15 04:10:24,538 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1908228.0, ans=0.125 2023-10-15 04:10:30,049 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-10-15 04:10:40,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.822e+02 2.003e+02 2.172e+02 3.224e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-15 04:10:40,735 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1908274.6666666667, ans=0.1 2023-10-15 04:10:46,392 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1908321.3333333333, ans=0.125 2023-10-15 04:11:01,354 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1908368.0, ans=0.125 2023-10-15 04:11:04,704 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1908368.0, ans=0.125 2023-10-15 04:11:06,008 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908368.0, ans=0.1 2023-10-15 04:11:39,044 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1908508.0, ans=0.125 2023-10-15 04:11:53,425 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1908554.6666666667, ans=0.09899494936611666 2023-10-15 04:11:58,194 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1908601.3333333333, ans=0.025 2023-10-15 04:12:08,858 INFO [train.py:1031] (3/4) Epoch 30, batch 13000, loss[loss=0.1905, simple_loss=0.2871, pruned_loss=0.04691, over 16833.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2775, pruned_loss=0.04578, over 32733877.74 frames. ], batch size: 155, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 04:12:16,536 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-10-15 04:12:40,729 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1908741.3333333333, ans=0.0 2023-10-15 04:12:41,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.881e+02 2.009e+02 2.323e+02 3.405e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-15 04:12:46,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1908741.3333333333, ans=0.0 2023-10-15 04:13:15,126 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1908834.6666666667, ans=0.125 2023-10-15 04:13:21,279 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1908881.3333333333, ans=0.125 2023-10-15 04:13:26,731 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1908881.3333333333, ans=0.0 2023-10-15 04:14:02,778 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1909021.3333333333, ans=0.5 2023-10-15 04:14:04,599 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1909021.3333333333, ans=0.125 2023-10-15 04:14:09,190 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-10-15 04:14:24,468 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.82 vs. limit=10.0 2023-10-15 04:14:30,234 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909114.6666666667, ans=0.1 2023-10-15 04:14:34,517 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1909114.6666666667, ans=0.125 2023-10-15 04:14:37,488 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1909161.3333333333, ans=0.125 2023-10-15 04:14:41,699 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=22.5 2023-10-15 04:14:47,600 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.61 vs. limit=15.0 2023-10-15 04:14:56,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.904e+02 2.113e+02 2.350e+02 3.214e+02, threshold=4.226e+02, percent-clipped=0.0 2023-10-15 04:15:08,998 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1909254.6666666667, ans=0.2 2023-10-15 04:15:12,198 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1909254.6666666667, ans=0.0 2023-10-15 04:15:35,097 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.51 vs. limit=10.0 2023-10-15 04:15:39,335 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1909394.6666666667, ans=0.1 2023-10-15 04:15:56,011 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1909441.3333333333, ans=0.125 2023-10-15 04:16:00,899 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=12.0 2023-10-15 04:16:04,363 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1909441.3333333333, ans=0.2 2023-10-15 04:16:19,550 INFO [scaling.py:1069] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:16:20,631 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1909488.0, ans=0.125 2023-10-15 04:16:23,435 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-10-15 04:17:01,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.822e+02 1.959e+02 2.153e+02 3.011e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-15 04:17:07,763 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1909721.3333333333, ans=0.025 2023-10-15 04:17:16,045 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1909721.3333333333, ans=0.125 2023-10-15 04:17:29,920 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1909768.0, ans=0.2 2023-10-15 04:17:30,876 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1909768.0, ans=0.2 2023-10-15 04:17:30,917 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1909768.0, ans=0.125 2023-10-15 04:17:48,059 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-10-15 04:17:58,348 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909908.0, ans=0.1 2023-10-15 04:18:07,955 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.31 vs. limit=22.5 2023-10-15 04:18:14,422 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909954.6666666667, ans=0.1 2023-10-15 04:18:47,525 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1910094.6666666667, ans=0.125 2023-10-15 04:19:06,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.916e+02 2.061e+02 2.277e+02 3.457e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-15 04:19:07,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1910141.3333333333, ans=0.125 2023-10-15 04:19:09,427 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.60 vs. limit=6.0 2023-10-15 04:19:14,950 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1910188.0, ans=0.125 2023-10-15 04:19:25,201 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=1910234.6666666667, ans=15.0 2023-10-15 04:19:25,379 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=22.5 2023-10-15 04:19:38,938 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1910281.3333333333, ans=0.125 2023-10-15 04:20:06,496 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1910374.6666666667, ans=0.0 2023-10-15 04:20:06,696 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.78 vs. limit=10.0 2023-10-15 04:20:12,601 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.42 vs. limit=15.0 2023-10-15 04:20:15,854 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1910421.3333333333, ans=0.0 2023-10-15 04:20:24,943 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1910421.3333333333, ans=0.125 2023-10-15 04:20:39,054 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1910514.6666666667, ans=0.125 2023-10-15 04:20:50,867 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1910514.6666666667, ans=0.125 2023-10-15 04:20:53,139 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910561.3333333333, ans=0.1 2023-10-15 04:20:58,122 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1910561.3333333333, ans=0.2 2023-10-15 04:21:04,256 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1910608.0, ans=0.125 2023-10-15 04:21:11,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.854e+02 2.015e+02 2.235e+02 2.698e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-15 04:21:20,026 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1910654.6666666667, ans=0.04949747468305833 2023-10-15 04:21:20,065 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1910654.6666666667, ans=0.125 2023-10-15 04:21:23,506 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.56 vs. limit=15.0 2023-10-15 04:22:09,690 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1910841.3333333333, ans=0.0 2023-10-15 04:22:12,191 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.16 vs. limit=22.5 2023-10-15 04:22:17,961 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1910888.0, ans=0.0 2023-10-15 04:22:29,957 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1910934.6666666667, ans=0.1 2023-10-15 04:22:41,649 INFO [train.py:1031] (3/4) Epoch 30, batch 13500, loss[loss=0.1987, simple_loss=0.2876, pruned_loss=0.05485, over 16829.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2769, pruned_loss=0.04554, over 32766834.34 frames. ], batch size: 67, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 04:22:41,922 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1910981.3333333333, ans=0.125 2023-10-15 04:22:43,048 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1910981.3333333333, ans=0.0 2023-10-15 04:22:44,234 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2023-10-15 04:22:45,142 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1910981.3333333333, ans=0.125 2023-10-15 04:23:02,973 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1911028.0, ans=0.2 2023-10-15 04:23:11,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.864e+02 2.000e+02 2.170e+02 3.073e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-15 04:23:41,774 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1911214.6666666667, ans=0.1 2023-10-15 04:23:48,368 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1911214.6666666667, ans=0.125 2023-10-15 04:24:03,836 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-10-15 04:24:13,749 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1911308.0, ans=0.0 2023-10-15 04:24:28,590 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1911401.3333333333, ans=0.125 2023-10-15 04:25:07,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.849e+02 2.027e+02 2.303e+02 3.665e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-15 04:25:07,491 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911541.3333333333, ans=0.1 2023-10-15 04:25:13,405 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1911588.0, ans=0.125 2023-10-15 04:25:14,112 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1911588.0, ans=0.125 2023-10-15 04:25:19,840 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1911588.0, ans=0.05 2023-10-15 04:25:25,129 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1911634.6666666667, ans=0.125 2023-10-15 04:25:36,180 INFO [scaling.py:199] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1911681.3333333333, ans=0.1 2023-10-15 04:25:37,315 INFO [scaling.py:979] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.93 vs. limit=22.5 2023-10-15 04:25:40,281 INFO [train.py:1246] (3/4) Done!